require(plotly)
require(hierfstat)
require(adegenet)
require(PopGenReport)
require(pheatmap)
require(tidyverse)
require(DiagrammeR)
require(poppr)
require(genepop)
require(graph4lg)
require(knitr)
require(pegas)
require(cowplot)
require(magrittr)

Readme

This is an rstudio project. If you’d like to pre-rendered figures, read a summary of analysis and view code, please open the html file in a browser.

To conduct the analyses on your computer, edit or run code: clone this repository into a directory on you r local machine and open the .Rproj file in Rstudio. All data and analyses are available in the github repository at https://github.com/david-dayan/rogue_half_pounder.git

Rationale

In addition to winter early-summer and late-summer runs, some steelhead on the Rogue express a “half-pounder” life-history. Half-pounders exhibit a false-spawning run (although some precocious males may spawn) as juveniles during the summer months, then return to sea to continue the marine growth phase. There is great interest in half-pounders from a management perspective, because (i) half pounder abundance should integrate juvenile freshwater and early ocean conditions for steelhead regardless of whether they express the half pounder phenotype (ii) half pounder abundance is predictive of steelhead abundance in historical dam passage counts, and (iii) half-pounders are a unique fishery of great cultural and economic value. However, the relative proportion of winter vs early summer vs late-summer run life histories expressed by half pounders later in life is unknown. Furthermore there is limited genetic information about the three adult runs of steelhead on the Rogue, so improving our understanding of the relationships between these runs is key to our research objectives regarding half-pounders.

This study attempts to use neutral and adaptive GTseq markers to examine population structure and run timing among Rogue River steelhead. We also attempt to classify half-pounders into run-timing groups on the basis of run-timing associated genetic markers.

Data Summary

Samples

Half Pounders
Samples were collected during ODFW’s Lower Rogue Seining Project in 2018 and 2019. The Lower Rogue Seining Project estimates escapement for Coho, late-summer run O. mykiss, half-pounder O. mykiss and fall chinook by beach seining near Huntley Park at approximate river mile 8, three times weekly from July through October. Half-pounder O. mykiss were identified as individuals with fork length 250 - 410mm and sampled in batches of up to 50 fish each day for 11 days from September 7th to October 1st 2018 totaling 384 individuals and 18 days from August 14th to September 25th 2019 totaling 331 individuals. Caudal fin clips were taken for DNA extraction, and placed in daily batch vials containing 95% ethanol. Note that due to batch collection of fin clips, that these numbers are inflated (some individuals have multiple tissue samples - identified later and filtered)

All half-pounders are non-marked and assumed to be natural origin fish.

Also ~5% of samples represented twice in GTseq library as QAQC samples

Adults
For brevity and easy code comprehension, throughout the notebook early-summer run steelhead are referred to as summer run and late-summer run steelhead are referred to as fall run.

50 winter and 45 early-run summer run fish. Adult summer steelhead were sampled at the Cole River Hatchery sorting pond on 6/26/2019 and (b) adult winter steelhead were sampled at Cole Rivers Hatchery (Rogue River) and the Applegate River from adult brood stock for 2019. 166 late-summer run fish (fall run) were sampled at the Huntley Park seine on the lower Rogue in 2019 and an additional 119 late-summer fish were collected at Huntley in 2020.

All but 4 early-summer run adults are marked and assumed to hatchery origin fish. The final filtered dataset contained 1 natural origin and 41 hatchery origin fish. Winter run fish include NOR and HOR fish. After filtering the final dataset contains 18 HOR and 22 NOR fish. Early summer run fish are all unmarked and assumed NOR. All late-summer fish were unmarked and assumed NOR.

Genotype Data

Information about sequencing data, genotype calling and filtering is available in the relevant R notebook (2021_genotyping_notebook.Rmd) in this repository.

load("./genotype_data_2021/genind_2.0.R")
load("./genotype_data_2021/genotypes_2.2.R")
genos_2.3 <- genos_2.2
genind_2.1 <- genind_2.0
genos_2.3 <- ungroup(genos_2.3)
kable(genos_2.3 %>%
  group_by(run, year) %>%
  tally(), caption = "Final Filtered Dataset")
Final Filtered Dataset
run year n
fall 2019 157
fall 2020 118
halfpounder 2018 338
halfpounder 2019 305
Summer 2019 42
Winter 2019 40
#save a file with this info
progeny <- readxl::read_xlsx("../meta_data/2019_summer/Omy Rogue2019 steelhead datasheets.xlsx", sheet = 4) 
progeny <- progeny %>%
  select(SFGL_ID, Origin)
select(genos_2.3, Sample, run, year, Date) %>%
  left_join(progeny, by = c("Sample" = "SFGL_ID")) %>%
  mutate(Origin = replace(Origin, Origin == "AD", "HOR")) %>%
  mutate(Origin = replace(Origin, Origin == "1", "NOR")) %>%
  replace_na(list(Origin = "NOR")) %>%
  write_tsv("supplemental_files_for_ms/sample_metadata_2021.txt")#convert AD (HOR) and 1 (NOR)

Marker info

Throughout the notebook we refer to subsets of markers based on annotations provided by CRITFC. To my knowledge these annotations come from decisions made when the panel was assembled and edited (i.e. a run timing associated marker came out of a GWAS fro run timing). Broadly we examine 4 subsets of markers:

  1. All markers: all 390 GTseq markers available in the full SFGL O mykiss GTseq panel. (The dataset here has fewer markers due to filtering - see genotyping notebook) (353 markers)
  2. Neutral Markers: Any markers annotated as “neutral,” with no other annotations. From my understanding these were mostly selected from broad spatial scale RADseq studies that refelct broad scale geographic structure and may or may not be in linkage with unkownn adaptive loci.
  3. Adaptive Markers: any markers with a known adaptive functional annotation.
  4. Run-timing markers: a subset of adaptive markers with specific associations with run-timing in O. mykiss.

The table below demonstrates the annotations

marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)

marker_summary <- marker_mapping %>%
  mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>%
  filter(marker %in% colnames(genos_2.3)) %>%
  mutate(neutral = if_else(str_detect(`Presumed Type`, 'neutral|Neutral'), "neutral", "adaptive")) %>%
  mutate(run_timing = if_else(str_detect(`Presumed Type`, 'run|Run'), "run_timing", "non-run_timing")) %>%
  dplyr::select(marker, chr, start, 'Presumed Type', neutral, run_timing, Source)

DT::datatable(marker_summary, options = list(pageLength=10))

Workflow

Our primary goals:
(a) examine population structure within and among the three run timing groups in the Rogue River (early summer, late summer and winter runs) (late summer referred to as fall in notebook)
(b) classify half-pounders into run timing group.

To begin, we will collect nucleotide diversity index at the marker and population levels, examine population structure at neutral and adaptive markers using multivariate and bayesian approaches, and assign half pounders to run timing groups based on a genetic classifier.

Diversity Metrics

Here we calculate both per-marker and per-population diversity metrics (Ho He HWE Fis and Fst)

Heterozygosity

Heterogyzity metrics

# first Ho and He
n.pop <- seppop(genind_2.1)

hobs <- lapply(n.pop, function(x) (summary(x)$Hobs))
hobs <- as.data.frame(t(do.call(rbind, hobs)))
hobs <- hobs %>%
  rownames_to_column(var="marker")
hobs <- hobs %>%
  pivot_longer(-marker, names_to = "run", values_to = "Ho")
#ggplot(hobs)+geom_boxplot(aes(x=pop, y=hobs))+theme_classic()+xlab("Run Timing")+ylab("Observed Heterozygosity")

hexp <- lapply(n.pop, function(x) (summary(x)$Hexp))
hexp <- as.data.frame(t(do.call(rbind, hexp)))
hexp <- hexp %>%
  rownames_to_column(var="marker") %>%
  pivot_longer(-marker, names_to = "run", values_to = "He")

marker_divs <- hexp %>%
  left_join(hobs)

#now lets add the annotation status (neutral vs adaptive)

marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)

marker_divs <- marker_mapping %>%
  select(marker, `Presumed Type`) %>%
  mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>% #marker name convention is different
  mutate(marker = str_replace(marker, "\\.", "_")) %>% 
  mutate(neutral = if_else(str_detect(`Presumed Type`, 'Adaptive|adaptive'), "adaptive", "neutral")) %>%
  right_join(marker_divs)

# lets add known run-timing marker as a grouping variable, and conver to longer format
marker_divs <-  marker_divs %>%
  mutate(run_timing_marker = if_else(str_detect(`Presumed Type`, 'Run|run'), "run" , "non-run")) %>%
   pivot_longer(c("Ho", "He"), names_to = "obs_exp", values_to ="H")

#nice, now lets make a long table


# lets throw up some nice plots here
# first for all markers
ggplot(data=marker_divs)+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("all markers")

ggplot(data=marker_divs[marker_divs$neutral=="neutral",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("neutral markers")

ggplot(data=marker_divs[marker_divs$neutral=="adaptive",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("adaptive markers")

ggplot(data=marker_divs[marker_divs$run_timing_marker=="run",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("run-timing markers")

#publication plot fig. 1
#ggplot(data=drop_na(marker_divs[marker_divs$run_timing_marker=="run",]))+geom_boxplot(aes(x=run, y=H, fill = obs_exp), alpha = 0.8)+scale_fill_viridis_d(labels = c(expression(H[E]), expression(H[O])), name = " ")+theme_classic()+xlab("")+scale_x_discrete(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme(axis.text.x = element_text(size = 10, angle = 90))

# publication plot suppl fig. 1
# a <- ggplot(data=drop_na(marker_divs[marker_divs$neutral=="neutral",]))+geom_boxplot(aes(x=run, y=H, fill = obs_exp), alpha = 0.8)+scale_fill_viridis_d(labels = c(expression(H[E]), expression(H[O])), name = " ")+theme_classic()+xlab("")+scale_x_discrete(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme(legend.position = "none")+theme(axis.text.x = element_text(size = 10, angle = 90))
# 
# b <- ggplot(data=drop_na(marker_divs[marker_divs$neutral=="adaptive",]))+geom_boxplot(aes(x=run, y=H, fill = obs_exp), alpha = 0.8)+scale_fill_viridis_d(labels = c(expression(H[E]), expression(H[O])), name = " ")+theme_classic()+xlab("")+scale_x_discrete(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+ylim(0,0.68)+theme(axis.text.x = element_text(size = 10, angle = 90))
# 
# plot_grid(a,b, rel_widths = c(0.83,1), labels = c("(a)", "(b)"))

#supplemental table
marker_divs %>%
  group_by(neutral, run, obs_exp) %>%
  summarise(mean = mean(H)) %>%
  print(n = 24)
## # A tibble: 16 × 4
## # Groups:   neutral, run [8]
##    neutral  run         obs_exp  mean
##    <chr>    <chr>       <chr>   <dbl>
##  1 adaptive fall        He      0.259
##  2 adaptive fall        Ho      0.251
##  3 adaptive halfpounder He      0.262
##  4 adaptive halfpounder Ho      0.241
##  5 adaptive Summer      He      0.223
##  6 adaptive Summer      Ho      0.227
##  7 adaptive Winter      He      0.238
##  8 adaptive Winter      Ho      0.232
##  9 neutral  fall        He      0.307
## 10 neutral  fall        Ho      0.301
## 11 neutral  halfpounder He      0.308
## 12 neutral  halfpounder Ho      0.299
## 13 neutral  Summer      He      0.304
## 14 neutral  Summer      Ho      0.300
## 15 neutral  Winter      He      0.310
## 16 neutral  Winter      Ho      0.305
marker_divs %>%
  group_by(run_timing_marker, run, obs_exp) %>%
  summarise(mean = mean(H)) %>%
  print(n = 24)
## # A tibble: 16 × 4
## # Groups:   run_timing_marker, run [8]
##    run_timing_marker run         obs_exp    mean
##    <chr>             <chr>       <chr>     <dbl>
##  1 non-run           fall        He      0.290  
##  2 non-run           fall        Ho      0.281  
##  3 non-run           halfpounder He      0.292  
##  4 non-run           halfpounder Ho      0.281  
##  5 non-run           Summer      He      0.289  
##  6 non-run           Summer      Ho      0.286  
##  7 non-run           Winter      He      0.294  
##  8 non-run           Winter      Ho      0.289  
##  9 run               fall        He      0.355  
## 10 run               fall        Ho      0.387  
## 11 run               halfpounder He      0.357  
## 12 run               halfpounder Ho      0.266  
## 13 run               Summer      He      0.00933
## 14 run               Summer      Ho      0.00992
## 15 run               Winter      He      0.0960 
## 16 run               Winter      Ho      0.102

Significance Testing

Are these differences between populations significant? What about differences between nuetral and adaptive sets of loci. To test with the same number of loci (across populations) we’ll use a monte-carlo test and 999 permutations. To test across different sets of loci (neutral vs adaptive), we’ll use a simple t-test

#lets make a way to easily called different snp classes 
#"neutral" markers 

neutral_loci_names <- marker_mapping %>%
  filter(str_detect(`Presumed Type`, 'neutral|Neutral')) %>%
  dplyr::select(marker)

#different naming convention, lets fix
neutral_loci_names <- str_replace(neutral_loci_names$marker, "Omy28", "Chr28")

#run timing markers
run_timing_loci_names <- marker_mapping %>%
  filter(chr == "28" | CRITFC_chromosome == "28") %>%
  filter(str_detect(`Presumed Type`, 'run|Run')) %>%
  select(marker)

#different naming convention, lets fix
run_timing_loci_names <- str_replace(run_timing_loci_names$marker, "Omy28", "Chr28")

## test for all pairwise differences at run-timing markers
#make a pop separted genind at run timing markers
chr28_seppop  <- seppop(genind_2.1[loc=run_timing_loci_names])


# fall-half
a <- Hs.test(chr28_seppop$fall, chr28_seppop$halfpounder)

#fall-summer
b <- Hs.test(chr28_seppop$fall, chr28_seppop$Summer)

#fall-winter
c <- Hs.test(chr28_seppop$fall, chr28_seppop$Winter)

#half-summer
d <- Hs.test(chr28_seppop$halfpounder, chr28_seppop$Summer)

#half-winter
e <- Hs.test(chr28_seppop$halfpounder, chr28_seppop$Winter)

#summer-winter
f <- Hs.test(chr28_seppop$Summer, chr28_seppop$Winter)

#now make a nice table
df <- as.data.frame(cbind(c("fall", "fall", "fall", "half", "half", "summer"), c("half", "summer", "winter", "summer", "winter", "winter"), c(a$pvalue,b$pvalue,c$pvalue,d$pvalue,e$pvalue,f$pvalue)))
colnames(df) <- c("pop1", "pop2","p-value")
kable(df, caption = "Monte-Carlo p-value for difference in observed Heterozygosity at run-timing markers")
Monte-Carlo p-value for difference in observed Heterozygosity at run-timing markers
pop1 pop2 p-value
fall half 0.779
fall summer 0.001
fall winter 0.001
half summer 0.001
half winter 0.001
summer winter 0.001
## Next lets look if there is increased/decreased diversity within a population at these markers compared to neutral
neutral_seppop  <- seppop(genind_2.1[loc=neutral_loci_names])


#fall
a <- t.test(summary(chr28_seppop$fall)$Hexp, summary(neutral_seppop$fall)$Hexp, alternative = "greater")

#half
b <- t.test(summary(chr28_seppop$halfpounder)$Hexp, summary(neutral_seppop$halfpounder)$Hexp, alternative = "less")


#summer
c <- t.test(summary(chr28_seppop$Summer)$Hexp, summary(neutral_seppop$Summer)$Hexp, alternative = "less")

#winter
d <- t.test(summary(chr28_seppop$Winter)$Hexp, summary(neutral_seppop$Winter)$Hexp,  alternative = "less")

#now make a nice table
df <- as.data.frame(cbind(c("fall", "half", "summer", "winter"),c(a$p.value,b$p.value,c$p.value,d$p.value)))
colnames(df) <- c("pop", "p-val")
kable(df, caption = "T-test p-value for difference He at run timing markers vs neutral markers")
T-test p-value for difference He at run timing markers vs neutral markers
pop p-val
fall 0.204800241608908
half 0.809754645532412
summer 4.78747565212889e-28
winter 0.000167292119122526

All pairwise population comparisons of He at chr28 are significantly different (p=0.001) EXCEPT fall-halfpounder. Winter and Summer populations demonstrate significantly reduced He at chr28 markers relative to neutral markers.

Are there significant departures from HWE at the loci level?

# here we use the hw.test function from pegas (exact test based on Monte Carlo permutations of alleles, 1000 permutations)
HWE.test <- lapply(n.pop, function(x) hw.test(x, B=1000))
# here we take the list of dataframes of p-values and combine into a single dataframe
hwe <- reduce(HWE.test, cbind)
hwe <- hwe[,c(4,8,12,16)]
colnames(hwe) <- c("fall", "halfpounder", "summer", "winter")
hwe <- as.data.frame(hwe)

# next we correct for multiple comparisons
p.adj <- as.data.frame(apply(hwe, MARGIN = 2, function(x) p.adjust(x, "fdr")))
hwe_exceed <- p.adj %>% rownames_to_column(var="marker") %>%
  pivot_longer(-marker, names_to = "run", values_to = "fdr") %>%
  filter(fdr < 0.1)

hwe_exceed <- hwe_exceed %>%
  left_join(pivot_wider(marker_divs, names_from = obs_exp, values_from = H)) %>%  
  mutate(direction = if_else(He > Ho, "excess_homo", "excess_hetero")) %>%
  filter(direction == "excess_homo")

a <- hwe_exceed%>%
  group_by(run) %>%
  tally()

kable(a, caption = "Number of markers significantly out of HWE")
Number of markers significantly out of HWE
run n
fall 33
halfpounder 58

Yes, see table above for number of SNPs outside of HWE (fdr < 0.1) per “population” Note that this may differ from the publication numbers because this notebook rendering was not the one used for final results (e.g. other runs of the monte-carlo simulations may produce slighlty different results)

#lets check if there is an enrichment of run-timing and adaptive markers in markers out of HWE in fall and half-pounders
  

#lets get marker info, but only for markers in the panel
marker_mapping2 <- marker_mapping %>%
  mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>%
  filter(marker %in% colnames(genos_2.3))

# halfpunder enriched for run timing markers
a <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$run_timing_marker=="run",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$run_timing_marker!="run",])
d <- nrow(marker_mapping2)-b

hr <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")

#fall for run-timing markers
a <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$run_timing_marker=="run",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$run_timing_marker!="run",])
d <- nrow(marker_mapping2)-b

fr <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")

#halfpounder for adaptive markers
a <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$neutral=="adaptive",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$neutral!="adaptive",])
d <- nrow(marker_mapping2)-b

ha <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")

#fall for adaptive markers
a <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$neutral=="adaptive",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$neutral!="adaptive",])
d <- nrow(marker_mapping2)-b

fa <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")



# summer enriched for run timing markers
a <- nrow(hwe_exceed[hwe_exceed$run=="Summer" & hwe_exceed$run_timing_marker=="run",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="Summer" & hwe_exceed$run_timing_marker!="run",])
d <- nrow(marker_mapping2)-b

sr <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")

#make a nice table
df <- as.data.frame(cbind(c("fall", "fall", "half", "half"),c("run", "all adaptive", "run", "all adaptive"),c(fr$p.value, fa$p.value, hr$p.value, ha$p.value), c(fr$estimate, fa$estimate, hr$estimate, ha$estimate)))
colnames(df) <- c("pop", "marker type", "p-val", "fold-enrichment")
kable(df, caption = "enrichment (fisher's exact test) of marker types among markers out of HWE")
enrichment (fisher’s exact test) of marker types among markers out of HWE
pop marker type p-val fold-enrichment
odds.ratio fall run 5.27646052659307e-11 0
odds.ratio.1 fall all adaptive 0.238093888963049 0.726210991591334
odds.ratio.2 half run 2.0324732994515e-09 0.140696878947655
odds.ratio.3 half all adaptive 0.340572491209925 0.860367706319913

Markers significantly out of HWE (Ho < He: excess of homozygotes) are enriched for run-timing markers in half-pounders only.

Now let’s make a nice visual representation of this information (half pounder demonstrate a dearth of heterozygotes at run timing markers)

plot_data_half <- marker_divs %>%
  filter(run == "halfpounder") %>%
  pivot_wider( names_from = obs_exp, values_from = H)

ggplot(plot_data_half)+geom_point(aes(He, Ho, color = run_timing_marker))+geom_abline(aes(intercept=0, slope=1))+coord_equal(ratio=1)+ylim(0,0.6)+scale_color_viridis_d()+theme_classic()+ggtitle("halfpounder")

plot_data_fall <- marker_divs %>%
  filter(run == "fall") %>%
  pivot_wider( names_from = obs_exp, values_from = H)

ggplot(plot_data_fall)+geom_point(aes(He, Ho, color = run_timing_marker))+geom_abline(aes(intercept=0, slope=1))+coord_equal(ratio=1)+ylim(0,0.6)+scale_color_viridis_d()+theme_classic()+ggtitle("fall")

The results in halfpounders are very clear.

Heterozygosity results summary

At neutral markers, there is a slight reduction of observed from expected heterozygosity, most pronounced among summer run fish. At run timing markers, there is a significant reduction in diversity (He) relative to neutral markers among the summer run and winter run fish (one-sided t-test), and increased diversity at fall run and half-pounders (not significant). Furthermore, there is an excess of homozygosity at run timing associated SNPs in the half-pounders (but not fall fish) suggesting some structure within this group. Among markers with significant departures from HWE within half-pounders, there is a significant enrichment of run-timing associated markers, but not significant enrichment for markers annotated as adaptive overall. Fall run fish also had some markers out of HWE, including both neutral and adaptive markers.

Fst

Let’s move on to f-statistics. We’ll use hierfstats for most of this work, so the first step is to convert to a hierfstat format. Then we’ll calculate some basic statists and Fst (Nei)

fstat <- genind2hierfstat(genind_2.1)
colnames(fstat) <- c(pop, names(genind_2.1$loc.n.all))
# and also lets make datasets at different sets of loci, because hierfstat doesn't easily retain loci names for later use
fstat_neutral <- genind2hierfstat(genind_2.1[loc=neutral_loci_names])
colnames(fstat_neutral) <- c(pop, names(genind_2.1[loc=neutral_loci_names]$loc.n.all))
fstat_run_timing <- genind2hierfstat(genind_2.1[loc=run_timing_loci_names])
colnames(fstat_run_timing) <- c(pop, names(genind_2.1[loc=run_timing_loci_names]$loc.n.all))

#calculate datset wide basic stats
basicstats <- basic.stats(fstat)
basicstats_neutral <- basic.stats(fstat_neutral)
basicstats_run_timing <- basic.stats(fstat_run_timing)

kable(basicstats$overall, caption = "All markers Fstats")
All markers Fstats
x
Ho 0.2812
Hs 0.2901
Ht 0.2974
Dst 0.0073
Htp 0.2998
Dstp 0.0098
Fst 0.0247
Fstp 0.0326
Fis 0.0306
Dest 0.0138
kable(basicstats_neutral$overall, caption = "Neutral marker Fstats")
Neutral marker Fstats
x
Ho 0.3015
Hs 0.3109
Ht 0.3123
Dst 0.0014
Htp 0.3127
Dstp 0.0019
Fst 0.0044
Fstp 0.0059
Fis 0.0301
Dest 0.0027
kable(basicstats_run_timing$overall, caption = "Run Timing Marker Fstats")
Run Timing Marker Fstats
x
Ho 0.1911
Hs 0.2059
Ht 0.3739
Dst 0.1680
Htp 0.4300
Dstp 0.2240
Fst 0.4493
Fstp 0.5210
Fis 0.0718
Dest 0.2821

Marker level Fst

Now let’s look at the distribution of Fst and check if markers with certain annotations are enriched among outliers.

basicstats$perloc$run_timing <- if_else(rownames(basicstats$perloc) %in% run_timing_loci_names, "run", "non-run")
basicstats$perloc$neutral <- if_else(rownames(basicstats$perloc) %in% neutral_loci_names, "neutral", "adaptive")

ggplot(basicstats$perloc)+geom_histogram(aes(x=Fst, fill=neutral))+theme_classic()

ggplot(basicstats$perloc)+geom_histogram(aes(x=Fst, fill=run_timing))+theme_classic()

#check if any non-run timing markers have high Fst
#max((basicstats$perloc[basicstats$perloc$run_timing=="non-run",])$Fst, na.rm = TRUE)

Yes, there are some clear fst outlier loci in the dataset. These outliers are composed solely of known run-timing markers, but not all run-timing markers demostrate high differentiation. The maximum fst on a non-run-timing marker was 0.0484

Pairwise Population Fst

Let’s collect genetic distance info on pairs of pops as well (Weir and Cochran 1984).

First, the full dataset:

genet.dist(fstat, method="WC84")
##                    fall halfpounder      Summer
## halfpounder 0.001201297                        
## Summer      0.038062979 0.037294054            
## Winter      0.018717941 0.013828218 0.082337046

Then just neutral loci:

genet.dist(fstat_neutral, method="WC84")
##                     fall  halfpounder       Summer
## halfpounder 0.0001593717                          
## Summer      0.0098611398 0.0084186634             
## Winter      0.0035831749 0.0028639667 0.0106229743

Let’s split the year classes and run again

First for full dataset:

genind_year <- genind_2.1

genind_year@pop <- as.factor(paste(genos_2.3$run, genos_2.3$year, sep = "_"))

fstat_year <- genind2hierfstat(genind_year)
colnames(fstat_year) <- c(pop, names(genind_year$loc.n.all))

Then for neutral only:

# and also lets make datasets at different sets of loci, because hierfstat doesn't easily retain loci names for later use
fstat_neutral_year <- genind2hierfstat(genind_year[loc=neutral_loci_names])
colnames(fstat_neutral_year) <- c(pop, names(genind_year[loc=neutral_loci_names]$loc.n.all))


genet.dist(fstat_year, method="WC84")
##                     fall_2019    fall_2020 halfpounder_2018 halfpounder_2019
## fall_2020        0.0003375932                                               
## halfpounder_2018 0.0019386041 0.0018140299                                  
## halfpounder_2019 0.0010113389 0.0015583954     0.0012435492                 
## Summer_2019      0.0372840328 0.0396714018     0.0412151558     0.0337328288
## Winter_2019      0.0188688247 0.0187409258     0.0110502287     0.0175765503
##                   Summer_2019
## fall_2020                    
## halfpounder_2018             
## halfpounder_2019             
## Summer_2019                  
## Winter_2019      0.0823370456
genet.dist(fstat_neutral_year, method="WC84")
##                     fall_2019    fall_2020 halfpounder_2018 halfpounder_2019
## fall_2020        0.0007798158                                               
## halfpounder_2018 0.0003828930 0.0006248472                                  
## halfpounder_2019 0.0002092204 0.0009664689     0.0006479603                 
## Summer_2019      0.0092590322 0.0110953922     0.0082973021     0.0088814430
## Winter_2019      0.0035084388 0.0040894856     0.0023192311     0.0037796054
##                   Summer_2019
## fall_2020                    
## halfpounder_2018             
## halfpounder_2019             
## Summer_2019                  
## Winter_2019      0.0106229743
#this was for Fst sig testing, but decided to skip this

mat.obs <- pairwise.WCfst(as.data.frame(cbind((genind_2.1$pop),genind_2.1$tab)))

NBPERM <- 100 # this is the number of permutations used for the p-values; for a publication at least 999 would be required.
mat.perm <- lapply(1:NBPERM, function(i) pairwise.WCfst(as.data.frame(cbind(sample(genind_2.1$pop, size = 882),genind_2.1$tab))))


allTests <- list()
 for(i in 1:(nrow(mat.obs)-1)){
   for(j in 2:nrow(mat.obs)){
   allTests[[paste(rownames(mat.obs)[i],rownames(mat.obs)[j],sep="-")]] <- as.randtest(na.omit(sapply(1:NBPERM, function(k) mat.perm[[k]][i,j])), mat.obs[i,j], alter="greater")
   }
}

Fst Results Summary

Moderate differentiation between most pairwise pop comps (not fall and halfpounder), but as suspected, this is driven mostly by the run timing markers. When examining only neutral annotated loci, differentiation is extremely low (less than 1%). Also of note here is that both the neutral and total estimates of differentiation between early and late summer (fall) runs is quite high. Mean Fst over all loci is about half the level of differentation between early summer and winter runs and nearly the same when examining only neutral loci. Indeed, the late summer run fish demonstrate lower differentiation from winter run than early summer run fish at both neutral and all loci.

Population Structure

From the Fst estimation in the section above we have our first ideas about population structure: (a) fall run and half pounder fish demonstrate little to no differentiation, and this group is less differentiated from winter run than summer run fish and (b) some evidence of structure WITHIN halfpounder and fall runs. In this section we will conduct several analyses to uncover population structure in greater detail.

STRUCTURE

Here we use a bayesian, model based clustering algorithm (STRUCTURE) to infer population structure and estimate admixture proportions of individual samples.

First we need to get our dataset ready for structure: remove linked loci, convert to structure format.

# first lets calculate LD (dartR has a great (fast) ld estimator that works right on genind files, so let's use this)
ldreport <- dartR::gl.report.ld(dartR::gi2gl(genind_2.1, verbose = 0), name = NULL, save = FALSE, nchunks = 2, ncores = 3, chunkname = NULL, verbose = 0)
##   Processing a SNP dataset
## Start conversion....
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using >save(object, file="object.rdata")

We’ll prune (keep one) the dataset of any locus-pairs with r2 > 0.2, then convert to STRUCTURE format

unlinked_genind <- genind_2.1[loc=-unique(ldreport[ldreport$R2>0.2,]$loc2)]
rm(ldreport)
#note just sort of crashed through this with a text editor, not easily logged, but the general idea was transpose the data, split columns (diploid to dual haploid) then convert data to integers
df <- genind2df(unlinked_genind)
df <- as.data.frame(t(df))
write_tsv(df, "genotype_data_2021/all.str.tmp")
df <- read_tsv("genotype_data_2021/all.str.tmp", col_names = FALSE)
df <- t(df)
write_tsv(as.data.frame(df), "genotype_data_2021/all.str", col_names = FALSE)

Note that the final structure data file did not differ between when this run the first time (without removing the 2 redundant and missingness loci). So structure did not need to be run a second time. This removed 23 highly linked SNPs from the dataset. (25 the first time, but two of these were filtered out at an earlier stage in this run)

Run Log

Structure was run in a GUI outside this computation notebook’s environment.
admixture model: admixture, with correlated allele frequency
burnin/mcmc: ran with k=1-6 for 100k iteration to check for convergence of alpha, strong convergence after a few hundred burn-in iterations, used 10,000/20,000 for final runs
replicates: did 10 replicates for k=1-6

Best K was not chosen, but best k according to the evanno method was estimated in structure harvester for inclusing in the manuscript in case anyone wanted to see it.
Replicate results within each K were clumpp’d using the clumpak algorithm on the clumpak webserver

Results

Here we visualize the structure results of the clumpp’d results of all K values

Best K

Best K was 2 according to the Evanno (delta K) method, however, it’s important to remember the bias toward k=2 when differentiation is low or there is no population structure using this method. Delta k literally cannot evaluate K=1. (see note below)

note: delta K suggests best k is two, however, particularly at low differentiation, the delta k method is biased towards k=2 (cullingham 2020), seems like a good place to remind myself that k is a model that doesn’t always fully catpure biological reality and comparing results across different levels of k can proide interesting insights, even when best k is unknown, particularly in the case of low differentiation. Also see gagnaire 20xx and latch 2006.

Structure Plots

Next let’s take the clumppd results and make some publication-ready figures. First plot is downsampled to 42 samples per “population”, second plot is full dataset.

As with other sampling procedures, the run presented in the final rendered notebook here may not be the run used in the publication (i.e. a different set of 42 samples may be presented here)

#import clump results into dataframes
# results are in files k*/majorcluster/clumppfiles/clumpindoutput
# took these files and captured the relevant data with a text editor (original input files are a mess with multiple field separators) and saved to new files
k2 <- read_tsv("./structure/halfpounder_2021/clumpak/formatted_results/k2.txt")
k3 <- read_tsv("./structure/halfpounder_2021/clumpak/formatted_results/k3.txt")
k4 <- read_tsv("./structure/halfpounder_2021/clumpak/formatted_results/k4.txt")
k5 <- read_tsv("./structure/halfpounder_2021/clumpak/formatted_results/k5.txt")


plot_data <- k2 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust2) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

a <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(2))

plot_data <- k3 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust3) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

b <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(3))

plot_data <- k4 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust4) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

c <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(4))

plot_data <- k5 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust5) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

d <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x", labeller = labeller( pop =  c("fall" ="Late-Summer",  "halfpounder"= "Half-Pounder", "summer" =  "Early-Summer", "winter" =  "Winter"))) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_text(size = 10, angle = 90)) +
  scale_fill_manual(values = viridisLite::viridis(5))



cowplot::plot_grid(a,b,c,d, ncol=1, rel_heights  = c(1,1,1,1.2))

plot_data <- k2 %>% 
  rownames_to_column(var="id") %>% 
 # sample the half_pounder and fall fish to smaller size for plot
  gather('cluster', 'prob', clust1:clust2) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()


a <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(2))

plot_data <- k3 %>% 
  rownames_to_column(var="id") %>% 
  gather('cluster', 'prob', clust1:clust3) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

b <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(3))

plot_data <- k4 %>% 
  rownames_to_column(var="id") %>% 
  gather('cluster', 'prob', clust1:clust4) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

c <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(4))

plot_data <- k5 %>% 
  rownames_to_column(var="id") %>% 
  gather('cluster', 'prob', clust1:clust5) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

d <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x", labeller = labeller( pop =  c("fall" ="Late-Summer",  "halfpounder"= "Half-Pounder", "summer" =  "Early-Summer", "winter" =  "Winter"))) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_text(angle = 90)) +
  scale_fill_manual(values = viridisLite::viridis(5))


cowplot::plot_grid(a,b,c,d, rel_heights = c(1,1,1,1.5) ,ncol=1)

STRUCTURE Results Summary

  1. Regardless of number of putative ancestral genetic clusters (k) modeled, there was a high degree of admixture within individuals, consistent with the observation of high gene flow among a priori assigned groupings.
  2. at K=3 and greater, there is a clear summer associated genetic cluster with highly variable amounts of introgression of this genetic cluster into fall and half pounder assigned fish, and to a lesser extent winter run fish
  3. Similar to the results from k means clustering (see dapc below), as k increases, there is decreasing shared cluster membership among winter and summer run fish, while half pounder and fall run fish continue to split evenly among clusters

PCA

Below we perform an unconstrained ordination of genetic data for both the neutral and the full datasets using PCA (implemented in ade4 and adegenet)

Neutral PCA

first we run pca only on the neutral dataset

neutral_genind <- genind_2.1[loc=neutral_loci_names]

#rename pops for mpublication quality figure
neutral_genind$pop <- as.factor(str_replace(neutral_genind$pop, "Summer", "Early-Summer"))
neutral_genind$pop <- as.factor(str_replace(neutral_genind$pop, "fall", "Late-Summer")) 
neutral_genind$pop <- as.factor(str_replace(neutral_genind$pop, "halfpounder", "Half-Pounder")) 

#reorder pop factor
neutral_genind$pop <- factor(neutral_genind$pop, levels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))

#replace missing data using mean allele frequency
neutral_genind_scale <- scaleGen(neutral_genind, NA.method="mean")

# run the PCA, dudi.pca uses an interactive prompt to choose pcs to retain, we retain all in order to run some analyses on eigenvalues
pca1 <- dudi.pca(neutral_genind_scale,cent=FALSE,scale=FALSE, scannf = FALSE, nf = 332)
barplot(pca1$eig[1:332],main="PCA eigenvalues")

s.class(pca1$li, pop(neutral_genind),xax=1,yax=2, col=transp(viridisLite::viridis(4),0.7), axesell=FALSE, cstar=0, cpoint=1, grid=FALSE)
add.scatter.eig(pca1$eig[1:10],nf=3,xax=1,yax=2, posi = "bottomright" )

#publication quality figure
# pca_vals <- pca1$li[,c(1:2)]
# pca_vals <- pca_vals %>%
#   rownames_to_column(var = "Sample") %>%
#   left_join(select(genos_2.3, Sample, run)) %>%
#   mutate(run = if_else(run == "fall", "Late-Summer", run)) %>%
#   mutate(run = if_else(run == "halfpounder", "Half-Pounder", run)) %>%
#   mutate(run = if_else(run == "Summer", "Early-Summer", run))
# 
# ggplot(data = pca_vals, aes(Axis1, Axis2, color = run))+geom_point(alpha = 0.5)+stat_ellipse()+theme_classic()+scale_color_viridis_d()+xlab("Principal Component 1")+ylab("Principal Component 2")
# barplot(pca1$eig[1:10],main="PCA eigenvalues")

Let’s also make an interactive 3d plot, since the third PC is not much less informative than the second

plot_ly(x=pca1$li$Axis1, y=pca1$li$Axis2, z=pca1$li$Axis3, type="scatter3d", mode="markers", color=neutral_genind$pop, alpha = 0.8)

I’m interactive!!!

As suggested by our Fst results, there is little differentiation at neutral markers. Also of note is a halfpounder outlier along the first pc. It should not come as a surprise that pca fails to differentiate among populations given the fst. PCA should not be able to differentiate among pops when fst < 1/(sqrt(sample size X number of markers)), which in our case is about 3%, substantially greater than the fst calculated for neutral markers (patterson 2006).

Also note that the primary axis (1 / X) captures variation within fall and half pounedrs that is largely absent from winter and summer runs. Once again, another hint of population structure within halfpounders/fall run fish, but this time suggestive of neutral variation that doesn’t exist within the fall/winter runs.

Full PCA

#replace missing data using mean allele frequency
genind_scale <- scaleGen(genind_2.1, NA.method="mean")

genind_2.3 <- genind_2.1

#rename pops for mpublication quality figure
genind_2.3$pop <- as.factor(str_replace(genind_2.3$pop, "Summer", "Early-Summer"))
genind_2.3$pop <- as.factor(str_replace(genind_2.3$pop, "fall", "Late-Summer")) 
genind_2.3$pop <- as.factor(str_replace(genind_2.3$pop, "halfpounder", "Half-Pounder")) 

#reorder pop factor
genind_2.3$pop <- factor(genind_2.3$pop, levels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))


# run the PCA, dudi.pca uses an interactive prompt to choose pcs to retain, we retain all in order to run some analyses on eigenvalues
pca1 <- dudi.pca(genind_scale,cent=FALSE,scale=FALSE, scannf = FALSE, nf = 353)
barplot(pca1$eig[1:353],main="PCA eigenvalues")

s.class(pca1$li, pop(genind_2.3),xax=1,yax=2, col=transp(viridisLite::viridis(4),0.7), axesell=FALSE, cstar=0, cpoint=1, grid=FALSE)
add.scatter.eig(pca1$eig[1:10],nf=3,xax=1,yax=2)

Let’s also make an interactive 3d plot, since the third PC is not much less informative than the second

plot_ly(x=pca1$li$Axis1, y=pca1$li$Axis2, z=pca1$li$Axis3, type="scatter3d", mode="markers", color=genind_2.3$pop, alpha = 0.8)

In the first three most important axes of genetic differentiation using the full dataset we observe complete overlap of fall and halfpounder fish, winter and summer fish are clearly separated, but the cloud of fall-halfpounder fish is inclusive of both of these groups. Three clusters of data are present in the data, (1) a winter-like cluster also including some fall run and half pounders, (2) a summer-like cluster also including some fall and half pounder fush and (3) a uniquely half-pounder fall run cluster positioned intermediate between the two.

Also, just like the neutral PCA, this captures some variation unique to fall/halfpounders. This time this variance is pushed back to the third PC, instead of the first.

Second also, remember that many closely linked run timing SNPs inflates the eigenvalue for the PC that captures variation at those SNPs, so the distribution of eigenvalues represents the data NOT the genome.

DAPC

Here we pull a little harder to find some population structure. We conduct a discriminant analysis on PCA transformed genetic variation (DAPC).

k means

First we will use k-means clustering to infer the number of genetic clusters in the dataset without applying a priori assumptions about the number of clusters or population assignments in the sample datset. Then we will check if these groups match well with assigned pops before proceeding to the DAPC.

Below are plots of bayesian information criterion across different numbers of clusters and cluster membership across a range of most-likely clusters:

#k means clustering, keep a lot (330 pcs) (kmeans won't overfit with two many pcs)
#note the number of clusters was chosen interactively, the code below executes the clustering using the best k

clusts <- find.clusters(genind_2.3, n.pca = 330, choose.n.clust = FALSE)

#plot BIC
bic <- as.data.frame(cbind(c(1:length(clusts$Kstat)), clusts$Kstat))
ggplot(bic)+geom_point(aes(x=V1, y=V2))+geom_line(aes(x=V1, y=V2))+theme_classic()+xlim(c(0, 50))+xlab("k")+ylab("BIC")
BIC for k-means clustering across different numbers of genetic clusters

BIC for k-means clustering across different numbers of genetic clusters

clusts <- find.clusters(genind_2.3, n.pca = 330, n.clust = 2)
kable(table(pop(genind_2.3), clusts$grp),caption = "a priori population vs genetic cluster" )
a priori population vs genetic cluster
1 2
Late-Summer 113 162
Half-Pounder 339 304
Early-Summer 0 42
Winter 40 0
clusts <- find.clusters(genind_2.3, n.pca = 330, n.clust = 3)
kable(table(pop(genind_2.3), clusts$grp),caption = "a priori population vs genetic cluster" ) 
a priori population vs genetic cluster
1 2 3
Late-Summer 83 72 120
Half-Pounder 218 256 169
Early-Summer 42 0 0
Winter 0 33 7
clusts <- find.clusters(genind_2.3, n.pca = 330, n.clust = 4)
kable(table(pop(genind_2.3), clusts$grp), caption = "a priori population vs genetic cluster" )
a priori population vs genetic cluster
1 2 3 4
Late-Summer 42 120 25 88
Half-Pounder 195 171 95 182
Early-Summer 0 0 35 7
Winter 33 7 0 0

The model fit is best at k=4. BIC tends to decrease until best fit only in a perfect island model, in practice, best K is often found at the lowest K that is a substantial improvement from the last. Here we use k=4, but when other k were used, generally the same result: Never any overlap between winter and summer run fish, fall and halfpounders always distributed among the 4 groups.

This result fits with our previous findings, there is structure and high diversity within fall and halfpounders (see heterozygosity section) and substantial overlap between fall-halfpounders and both winter and summer fish, but not between winter and summer fish.

This creates a complex scenario for conducting the DAPC, should we use a priori assigned populations as the basis of our DA? Simulation stuides (eg miller et al 2020) suggests that at low fst, the ability of k means clustering to succesfully recapitulate real genetic clusters is low, however, DA on biologically inaccurate grouping variables is not meaningful… Let’s do both for now as they are both informative in different ways.

DAPC de novo

#first optimize the PCs retained based on the a.score, so run dapc on the full pcs
#invisible(dapc_full_denovo <- dapc(genind_2.1, clusts$grp, n.pca = 330, n.da = 3 ))
#optim_a_denovo <- optim.a.score(dapc_full_denovo, n = 20)
#nice now we use the optimized a score to run our dapc for real
dapc_full_denovo <- dapc(genind_2.3, clusts$grp, n.pca = 25, n.da = 3 )

plot_data <- as.data.frame(cbind(dapc_full_denovo$ind.coord, genind_2.3$pop, clusts$grp))
colnames(plot_data) <- c("LD1", "LD2", "LD3", "pop", "grp")
plot_data$pop <- as.factor(plot_data$pop)
plot_data$grp <- as.factor(plot_data$grp)

ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color=pop))+scale_color_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme_classic()+ggtitle("DAPC by k-means cluster, color by a priori population")+guides(color=guide_legend("Run-Type")) 

ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color=grp))+scale_color_viridis_d()+theme_classic()+ggtitle("DAPC by cluster, color by k-means cluster")+guides(color=guide_legend("K-means cluster")) 

marker_loadings1 <- loadingplot(dapc_full_denovo$var.contr, axis=1,thres=.005, lab.jitter=1, main = "DA axis 1 loading plot")

markers1 <- unique(substr(names(marker_loadings1$var.values),1,nchar(names(marker_loadings1$var.values))-2))

marker_loadings2 <- loadingplot(dapc_full_denovo$var.contr, axis=2,thres=.02, lab.jitter=1, main = "DA axis 2 loading plot")

markers2 <- unique(substr(names(marker_loadings2$var.values),1,nchar(names(marker_loadings2$var.values))-2))

kable(marker_mapping2 %>%
  filter(marker %in% markers1 ) %>%
  select(marker, `Presumed Type`), caption = "markers heavily loaded into first discriminant axis")
markers heavily loaded into first discriminant axis
marker Presumed Type
Chr28_11667578 Adaptive. Run Timing
OmyR14589 Adaptive. Residency vs anadromy
Omy_bcAKala-380rd Neutral
OmyR24370 Adaptive. Residency vs anadromy
OmyR40252 Adaptive. Residency vs anadromy
OmyR19198 Adaptive. Residency vs anadromy
OmyR33562 Adaptive. Residency vs anadromy
Omy_SECC22b-88 Neutral. Possible linkage
Omy_GREB1_05 Adaptive. Run timing
Chr28_11625241 Adaptive. Run Timing
Chr28_11658853 Adaptive. Run Timing
Omy_RAD15709-53 Adaptive. Natal site Isothermality. Basin-wide, run-time - related
Chr28_11676622 Adaptive. Run Timing
Chr28_11683204 Adaptive. Run Timing
Chr28_11702210 Adaptive. Run Timing
kable(marker_mapping2 %>%
  filter(marker %in% markers2 ) %>%
  select(marker, `Presumed Type`), caption = "markers heavily loaded into second discriminant axis")
markers heavily loaded into second discriminant axis
marker Presumed Type
OmyR14589 Adaptive. Residency vs anadromy
Omy_bcAKala-380rd Neutral
OmyR24370 Adaptive. Residency vs anadromy
OmyR40252 Adaptive. Residency vs anadromy
OmyR19198 Adaptive. Residency vs anadromy
OmyR33562 Adaptive. Residency vs anadromy
# pub figure
# a <- ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color=pop), alpha = 0.8)+scale_color_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme_classic()+ggtitle("DAPC by k-means cluster, color by a priori population")+guides(color=guide_legend("Run-Type")) 
# 
# b <- ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color=grp), alpha = 0.8)+scale_color_viridis_d()+theme_classic()+ggtitle("DAPC by cluster, color by k-means cluster")+guides(color=guide_legend("K-means cluster")) 
# 
# plot_grid(a, b, ncol = 1, labels = "auto")

As suspected, given the the fact that fall and halfpounder fish do not fall into a single genetic cluster, de novo assigned genetic cluters do not do a great job of differentiating among a priori assigned run-timing groups. When breaking the fish into four groups these groups are mostly defined by covariation at ~15 markers including run timing markers, 2 neutral markers (linked to residency markers), and residency markers. Note that the loadings on the first axis for the run-timing markers are ~2 fold greater than other markers included in the table above and that this cutoff is arbitrary.

DAPC a priori

A note on the best number of pcs
Given that (a) we are using a priori assigned groups to assess differences among said groups and (b) the number of predictors is only somewhat less than the number of objects (n = 1000 and p= ~350), we have to be really cautious about overfitting here, so we’re going to run both cross-fold validation and the “a.score” procedure to determine the correct number of pcs to retain, then we will retain pcs according to which optimization strategy suggests the lower number of pcs. One issue here is that previous results suggest that at least one and potentially two (fall and half pounders) of our a priori groups is a synthetic (i.e. non exclusively genetic) assemblage of individuals. If this is the case, overfitting might cause some real issues and results might not the underlying biology. Therefore our choice of n.pc should attempt to limit overfitting by choosing the minimum PCs.

Final results:
The best pca according to optim.a.score was 103. After pc 7 however, there was only a gradual improvement with each aditional PC, two local maxima at pc 15 and pc 45 are also present. Best pca according to cross validation using overall group assignment success was 7, however cross validation using mean group assignment per group was 187, although there are local maxima at 7, and 16 pcs. Similar to the a.score approach, there is not a strong “arc” pattern in the latter as one might suspect for a cross validation procedure, instead fit improves rapidly after the first few pcs, declines, then and continues a gradual improvement. WE considered this to be a potential example of double descent, as once the number of PCs exceeds the group sample size, the model is overparametized (p > n). We chose the number of PCs that correspond to best value from the xval procedure using overall asignment, because of our concern with overfitting. This value also represents the last value that is a substantial increase from the previous in both other procedures. However, it should be noted that running DAPC with widely varying PCs did not qualitatively change the pattern of results. The degree of discrimination increased with increasing PCs and loadings on individual markers varied, but not the qualitative relationships among a priori assigned populations.

Now the DAPC

#first optimize the PCs retained based on the a.score, so run dapc on the full pcs
#invisible(dapc_full_apriori <- dapc(genind_2.3, n.pca = 330, n.da = 3 ))
#to speed this up in future knits, we ran optim.a.score once, but not in the notebook
#optim_a_apriori <- optim.a.score(dapc_full_apriori, smart=FALSE, n.sim=50)
#nice now we use the optimized a score to run our dapc for real
#mat <- as.matrix(scaleGen(genind_2.3, NA.method="mean", scale=FALSE, center=FALSE))
#xpop <- pop(genind_2.3)
#xval <- xvalDapc(mat, xpop, n.pca.max = 200, training.set = 0.9, result = "overall", center = TRUE, scale = FALSE, n.pca = seq(1,300, length.out = 60), n.rep = 50, xval.plot = TRUE)

#xval <- xvalDapc(mat, xpop, n.pca.max = 200, training.set = 0.9, result = "overall", center = TRUE, scale = FALSE, n.pca = seq(1,200), n.rep = 25, xval.plot = TRUE)

#xval_group <- xvalDapc(mat, xpop, n.pca.max = 200, training.set = 0.9, result = "groupMean", center = TRUE, scale = FALSE, n.pca = seq(1,200), n.rep = 25, xval.plot = TRUE)

invisible(dapc_full_apriori <- dapc(genind_2.3, n.pca = 7, n.da = 3 ))


plot_data <- as.data.frame(dapc_full_apriori$ind.coord)
plot_data$pop <- as.character(genind_2.3$pop)


ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color = pop), alpha = 0.5, size = 2)+theme_classic()+scale_color_viridis_d()+stat_ellipse(aes(x=LD1, y=LD2, color = pop))

scatter(dapc_full_apriori)

marker_loadings1 <- loadingplot(dapc_full_apriori$var.contr, axis=1,thres=.01, lab.jitter=1, main = "loading plot for DA 1")

markers1 <- unique(substr(names(marker_loadings1$var.values),1,nchar(names(marker_loadings1$var.values))-2))

marker_loadings2 <- loadingplot(dapc_full_apriori$var.contr, axis=2,thres=.006, lab.jitter=1, main = "loading plot for DA 2")

markers2 <- unique(substr(names(marker_loadings2$var.values),1,nchar(names(marker_loadings2$var.values))-2))

kable(marker_mapping2 %>%
  filter(marker %in% markers1 ) %>%
  select(marker, `Presumed Type`), caption = "Markers that load strongly onto first DA")
Markers that load strongly onto first DA
marker Presumed Type
Chr28_11667578 Adaptive. Run Timing
Omy_128996-481 Neutral
Omy_GREB1_05 Adaptive. Run timing
Chr28_11625241 Adaptive. Run Timing
Chr28_11658853 Adaptive. Run Timing
Omy_RAD15709-53 Adaptive. Natal site Isothermality. Basin-wide, run-time - related
Chr28_11676622 Adaptive. Run Timing
Chr28_11683204 Adaptive. Run Timing
Chr28_11702210 Adaptive. Run Timing
kable(marker_mapping2 %>%
  filter(marker %in% markers2 ) %>%
  select(marker, `Presumed Type`, chr, start), caption = "Markers that load strongly onto the second DA")
Markers that load strongly onto the second DA
marker Presumed Type chr start
Omy_112301-202 Neutral 3 37590514
Omy_103705-558 Neutral 17 7065921
OmyR14589 Adaptive. Residency vs anadromy 5 56162739
Omy_RAD42465-32 Adaptive. Maxiumum air temperature (warmest quarter). Basin-wide, top-outlier 18 25034263
Omy_bcAKala-380rd Neutral 5 53469258
Omy_G3PD_2-371 Neutral 2 6083880
OmyR24370 Adaptive. Residency vs anadromy 5 28579317
OmyR40252 Adaptive. Residency vs anadromy 5 31675237
OmyR19198 Adaptive. Residency vs anadromy 5 34973454
OmyR33562 Adaptive. Residency vs anadromy 5 47337494
Omy_SECC22b-88 Neutral. Possible linkage 5 61828845
Omy_101993-189 Neutral 17 21491237
Omy_RAD43612-42 Neutral 18 29118747
Omy_LDHB-2_e5 Neutral 21 24129878
Omy_LDHB-2_i6 Neutral 21 24130489
Chr28_11683204 Adaptive. Run Timing 28 11683151
Chr28_11702210 Adaptive. Run Timing 28 11702168
Omy_Il1b-198 Neutral 19 10329643
# #for ms figures
# plot_data %<>%
#  mutate(pop = as.factor(plot_data$pop)) %>%
#  mutate(pop = fct_relevel(pop, "Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))
# plot_data$pop <- as.factor(plot_data$pop)
# ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color = pop), alpha = 0.5, size = 2)+theme_classic()+scale_color_viridis_d(name = "Run",labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter" ))+stat_ellipse(aes(x=LD1, y=LD2, color = pop))

#for ms tables with loading value
#marker_loadings2a <- loadingplot(dapc_full_apriori$var.contr, axis=2,thres=.00000001, lab.jitter=1, main = "laoding plot for DA 2")
#write.table(as.data.frame(marker_loadings2a$var.values[seq(1, length(marker_loadings2a$var.values), 2)]), "load.txt", quote = FALSE, sep = "\t" )
#write.table(select(marker_mapping2, marker, `Presumed Type`, chr), "map.txt", quote=FALSE, sep = "\t")  

loads <- marker_loadings2$var.values*2
loads <- as.data.frame(loads)
loads <- loads %>%
 rownames_to_column(var = "marker") %>% 
 mutate(marker = str_sub(marker, 1, nchar(marker)-2)) %>%
  left_join(marker_mapping2) %>%
  select(marker, loads, `Presumed Type`) %>%
  filter(row_number() %% 2 == 0) %>%
  arrange(desc(loads))

Let’s also check where all the migration timing markers in the dataset fall in the distribution of variable loadings, as an ad hoc association test.

all_loadings <- loadingplot(dapc_full_apriori$var.contr, axis=1,thres=.00000001)

all_loads <- all_loadings$var.values*2
all_loads <- as.data.frame(all_loads)
all_loads <- all_loads %>%
 rownames_to_column(var = "marker") %>% 
 mutate(marker = str_sub(marker, 1, nchar(marker)-2)) %>%
  left_join(marker_mapping2) %>%
  select(marker, all_loads, `Presumed Type`) %>%
  filter(row_number() %% 2 == 0) %>%
  arrange(desc(all_loads))

all_loads %>%
  mutate(percentile = ntile(all_loads, 100)) %>%
  filter(marker %in% run_timing_loci_names) %>%
  select(marker, percentile)

Summary
DAPC largely discriminates among groups along the first discriminant axis. Fall and half pounder fish demonstrate little to no differentiation along any DA. The first DA strongly separates winter and summer run fish, and is driven by by 9 markers including 8 run timing associated markers and a neutral marker (Omy_128996-481). The second da captures much less variation among groups (78% vs 17%), and is driven largely by 5 residency markers and 13 additional markers including neutral markers and several associated with run timing and other adapative traits. Interstingly one of the “neutral” annotated markers (Omy_LDHB-2…) have been associated with residency vs anadromy in an association study in the klickitat river (doi.org/10.1080/00028487.2011.588131).

DAPC Classification

Next we use this a priori DAPC to attempt to classify half pounders into early-summer-like, winter-like or intermediate groups. Note: did this two ways, in the first we run a true classfiication scheme: exclude the fall and half pounder fish, run create a classifier then predict the fall and half pounder membership based on this classifier, however if fall run fish are a real group, the classifier should include all three or four groups and therefore would default back to the original DAPC. I left the true classifier (just winter vs summer) in the notebook, but I think we should just present the full a priori DAPC results. In any case, the effect on the end result is marginal

classifier <- dapc(genind_2.1[pop = c("Winter", "Summer")], n.da = 2, n.pca = 1) #optim a score is 1
pred.half<- predict.dapc(classifier, newdata=genind_2.1[pop=c("halfpounder" , "fall")])

preds <- as.data.frame(cbind(pred.half$posterior, pred.half$ind.scores))

#hmm pop info is not moving over easilt left just merge it from another df
colnames(fstat)[1] <- "pop"
pops_info <- fstat %>%
  rownames_to_column(var="sample") %>%
  select(sample, pop)
preds <- preds %>%
  rownames_to_column(var="sample") %>%
  left_join(pops_info)

ld1 <- as.data.frame(classifier$ind.coord) %>%
   bind_rows(as.data.frame(pred.half$ind.scores)) %>%
  rownames_to_column(var="sample") %>%
  left_join(pops_info)

ld1_2 <- as.data.frame(dapc_full_apriori$ind.coord) %>%
  rownames_to_column(var="sample") %>%
  left_join(pops_info)
    
#plot
a <- ggplot(data=ld1)+geom_density(aes(LD1, fill=pop, color=pop), alpha = 0.2)+scale_color_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+scale_fill_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme_classic()+ggtitle("True Classifier")+labs(color = "Run", fill = "Run")

b <- ggplot(data=ld1_2)+geom_density(aes(LD1, fill=pop, color=pop), alpha = 0.2)+scale_color_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+scale_fill_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme_classic()+ggtitle("A priori DAPC results")+labs(color = "Run", fill = "Run")

plot_grid(b,a, labels = "auto", ncol = 1)

# assignments using classifier
CIs <- ld1 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1 %>%
  filter(pop == "halfpounder") %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum(( LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "True Classifier - Half Pounders")
True Classifier - Half Pounders
winter_assigned summer_assigned unassigned
158 74 411
# assignments using original DAPC
CIs <- ld1_2 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1_2 %>%
  filter(pop == "halfpounder") %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum(( LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "A priori DAPC - Half Pounders")
A priori DAPC - Half Pounders
winter_assigned summer_assigned unassigned
322 65 256
#same as above but for fall fish

CIs <- ld1 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1 %>%
  filter(pop == "fall") %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum((LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "True Classifier - Fall Run")
True Classifier - Fall Run
winter_assigned summer_assigned unassigned
31 12 232
# assignments using original DAPC
CIs <- ld1_2 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1_2 %>%
  filter(pop == "fall") %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum((LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "A priori DAPC - Fall Run")
A priori DAPC - Fall Run
winter_assigned summer_assigned unassigned
127 8 140
#split by year
ld1_2 %<>%
  left_join(select(genos_2.3, Sample, year), by = c("sample" = "Sample"))

kable(ld1_2 %>%
  filter(pop == "fall") %>%
    group_by(year) %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum((LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "A priori DAPC - Fall Run")
A priori DAPC - Fall Run
year winter_assigned summer_assigned unassigned
2019 70 6 81
2020 57 2 59
kable(ld1_2 %>%
  filter(pop == "halfpounder") %>%
    group_by(year) %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum((LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "A priori DAPC - Halfpounder")
A priori DAPC - Halfpounder
year winter_assigned summer_assigned unassigned
2018 182 25 131
2019 140 40 125
ld1 %<>%
  left_join(select(genos_2.3, Sample, year), by = c("sample" = "Sample"))

kable(ld1 %>%
  filter(pop == "halfpounder") %>%
    group_by(year) %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum(( LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "True Classifier - Half Pounders")
True Classifier - Half Pounders
year winter_assigned summer_assigned unassigned
2018 233 53 52
2019 177 71 57
kable(ld1 %>%
  filter(pop == "fall") %>%
    group_by(year) %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum((LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "True Classifier - Fall Run")
True Classifier - Fall Run
year winter_assigned summer_assigned unassigned
2019 90 21 46
2020 75 12 31

Let’s look at the loadings of alleles that create this discriminant axis.

class_loadings <- loadingplot(classifier$var.contr, axis=1,thres=-1)

class_loads <- class_loadings$var.values*2
class_loads <- as.data.frame(class_loads)
class_loads <- class_loads %>%
 rownames_to_column(var = "marker") %>% 
 mutate(marker = str_sub(marker, 1, nchar(marker)-2)) %>%
  left_join(marker_mapping2) %>%
  select(marker, class_loads, `Presumed Type`) %>%
  filter(row_number() %% 2 == 0) %>%
  arrange(desc(class_loads))

class_loads %>%
  mutate(percentile = ntile(class_loads, 100)) %>%
  filter(marker %in% run_timing_loci_names) %>%
  select(marker, percentile)

When we use the original DAPC (not the classifier) we attempt to classify fish into winter or summer like using the first DA which accounts for ~80% of the among group variance and is driven almost exclusively by known run timing associated markers. We then draw 95% credible intervals for winter and summer run and count how many fall and half pounder fish fall beyond the lower bounds for these intervals (i.e. at least as winter-like or summer-like as the 95% CI). For fall fish, the majority are unassigned, nearly as many are winter assigned and a small portion are summer assignmed. The same pattern is observed among the half pounders, but ratio number of unassigned fish is somewhat lower.

Neutral DAPC

There is evidence of weak differentiation among populations at neutral markers from the estimation of Fst. Here we will attempt to magnify these differences using a discirminat analaysis among a priori assigned run types.

#neutraldapc <- dapc(neutral_genind, n.pca = 226, n.da= 3 )
#optim.a.score(neutraldapc, smart = FALSE)
#nice now we use the optimized a score to run our dapc for real
invisible(neutraldapc <- dapc(neutral_genind, n.pca = 97, n.da = 3 ))


plot_data <- as.data.frame(neutraldapc$ind.coord)
plot_data$pop <- as.character(genind_2.1$pop)

#scatter.dapc(neutraldapc)

plot_ly(x=plot_data$LD1, y=plot_data$LD2, z=plot_data$LD3, type="scatter3d", mode="markers", color=genind_2.1$pop, alpha = 0.8)
#pub quality figure
ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color = pop), alpha = 0.5, size = 2)+theme_classic()+scale_color_viridis_d(name = "Run",labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter" ))+stat_ellipse(aes(x=LD1, y=LD2, color = pop))

DA axis 1 separates summer from half pounder, fall and winter. DA 2 separates summer from the rest. DA3 also separates summer from the rest. Interestingly this is the same pattern as obsevred using the full dataset, despite the fact that the first DA was driven almost exclusively by run timing markers. This suggests either LD in our dataset between run timing and neutral markers, or that restricted gene flow tied to run timing leads differentiation at neutral markers. In any case it follows with the neutral Fst a that half pounders and fall run fish are more similar to winter run than summer run across the genome.

DAPC Results Summary

K means clustering always assigned half pounders and fall run fish to all genetic clusters, while always segregating winter and summer run fish. This suggests there is substantial structure with fall and half pounder fish. Similarly, when we attempt to discriminate among a priori assigned groups, the genetic variation contained within adults that make a fall run, as well as half-pounders is nearly entirely inclusive of winter run fish, and partially inclusive of summer run fish, however, winter and summer run fish are well discriminated.

In other words:
(a) winter and summer run fish can clearly be distinguished from one another using the genetic variation captured by our GTseq panel. Genetic variation among these two groups is driven by markers with known run-timing associations
(b) fall run and half pounder fish can not be discriminated from each other using the genetic information we have available
(c) fall run and half pounder fish demonstrate substantial within-“population” structure, and include fish that demonstrate a high degree of genetic similarity with both winter run and summer run fish, hwoever, most individuals demonstrate intermediate genetic charactistics. interestingly,this within population structure is driven neutral markers and residency associated markers

Run Timing Markers

Let’s look specifically at run-timing associated markers

marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)
run_timing_loci_names <- marker_mapping %>%
  filter(chr == "28" | CRITFC_chromosome == "28") %>%
  filter(str_detect(`Presumed Type`, 'run|Run')) %>%
  select(marker)

#different naming convention, lets fix
run_timing_loci_names <- str_replace(run_timing_loci_names$marker, "Omy28", "Chr28")

#run_timing_loci <- genos_2.3 %>%
#  select(Sample, Date, run, year, one_of(run_timing_loci_names))

#genind_pop <- seppop(genind_2.1)
#run_timing_fall <- genind_pop$fall[loc=run_timing_loci_names]
#run_timing_half <- genind_pop$halfpounder[loc=run_timing_loci_names]
#run_timing_winter <- genind_pop$Winter[loc=run_timing_loci_names]
#run_timing_summer <- genind_pop$Summer[loc=run_timing_loci_names]

Diagnostic Markers

Are any markers diagnostic (fixed or nearly fixed within a run-timing group)? Below we plot:
(a) a full heatmap of allele frequencies
(b) heatmap just at near diagnostic markers (0.9 vs 0.1 across winter and summer runs)
(c) a heatmap of all markers with run-timing associations

#because adegenet can be so difficult to use, let take advantage of popgenreport to collect our summary data
all_counts <- allele.dist(genind_2.1, mk.figures = FALSE)$count
#make into a dataframe
all_counts <- as.data.frame(do.call(rbind, all_counts))
colnames(all_counts) <- c("fall_count", "half_count", "summer_count", "winter_count")
all_counts$sum <- rowSums(all_counts, na.rm = TRUE)

all_freqs <- allele.dist(genind_2.1, mk.figures = FALSE)$frequency
#make into a dataframe
all_freqs <- as.data.frame(do.call(rbind, all_freqs))

all_freqs <- as.data.frame(cbind(all_freqs, all_counts))

##### get only minor allele
all_freqs$marker <- genind_2.1$loc.fac

#now group by marker and keep the minor allele, then convert counts to 
all_maf <- all_freqs %>%
  group_by(marker) %>%
  slice_min(sum) %>%
  replace(., is.na(.), 0)

#filter all maf to include only diagnostic or near diagnostic (0.95) markers between winter and summer runs
diagmaf <- all_maf %>%
  filter((Summer > 0.9 & Winter < 0.1) | (Summer < 0.1 & Winter > 0.9))

# plot big heatmap
tmat <- t(as.matrix(all_maf[,1:4]))
colnames(tmat) <- all_maf$marker
pheatmap(tmat, show_colnames  = F)

#plot as a heatmap
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(diagmaf[,1:4]))
colnames(tmat) <- diagmaf$marker
pheatmap(tmat, cluster_cols = FALSE)

#because adegenet can be so difficult to use, let take advantage of popgenreport to collect our summary data
run_timing_counts <- allele.dist(genind_2.1[loc=run_timing_loci_names], mk.figures = FALSE)$count
#make into a dataframe
run_timing_counts <- as.data.frame(do.call(rbind, run_timing_counts))
colnames(run_timing_counts) <- c("fall_count", "half_count", "summer_count", "winter_count")
run_timing_counts$sum <- rowSums(run_timing_counts, na.rm = TRUE)

run_timing_freqs <- allele.dist(genind_2.1[loc=run_timing_loci_names], mk.figures = FALSE)$frequency
#make into a dataframe
run_timing_freqs <- as.data.frame(do.call(rbind, run_timing_freqs))

run_timing_freqs <- as.data.frame(cbind(run_timing_freqs, run_timing_counts))

##### get only minor allele
#then put the marker names in the same order as the allele.dist (exlude the missing marker and duplicate marker first)
run_timing_loci_names <- run_timing_loci_names[run_timing_loci_names %in% colnames(genos_2.3)]
run_timing_freqs$marker <- rep(run_timing_loci_names[order(run_timing_loci_names)], each = 2)
#now group by marker and keep the minor allele, then convert counts to 
run_timing_maf <- run_timing_freqs %>%
  group_by(marker) %>%
  slice_min(sum) %>%
  replace(., is.na(.), 0)

#plot as a heatmap
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(run_timing_maf[,1:4]))
colnames(tmat) <- run_timing_maf$marker
pheatmap(tmat, cluster_cols = FALSE)

#repolarize greb05 and 7080-54
repol <- run_timing_maf %>%
  mutate(fall = ifelse(marker == "Omy_GREB1_05", (1 - fall), fall)) %>%
  mutate(halfpounder = ifelse(marker == "Omy_GREB1_05", (1 - halfpounder), halfpounder)) %>%
  mutate(Summer = ifelse(marker == "Omy_GREB1_05", (1 - Summer), Summer)) %>%
  mutate(Winter = ifelse(marker == "Omy_GREB1_05", (1 - Winter), Winter)) #%>%
#  mutate(fall = ifelse(marker == "Omy_RAD47080-54", (1 - fall), fall)) %>%
#  mutate(halfpounder = ifelse(marker == "Omy_RAD47080-54", (1 - halfpounder), halfpounder)) %>%
#  mutate(Summer = ifelse(marker == "Omy_RAD47080-54", (1 - Summer), Summer)) %>%
#  mutate(Winter = ifelse(marker == "Omy_RAD47080-54", (1 - Winter), Winter))

repol <- repol %>%
  select(fall, halfpounder, Winter, Summer, marker) %>%
  rename("Late-Summer" = fall, "Half-Pounder" = halfpounder, "Early-Summer" = Summer)

col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(repol[,1:4]))
colnames(tmat) <- repol$marker
pheatmap(tmat)

Of the 12 markers with known run-timing associations, 7 are fixed for alternative alleles in winter and summer run fish. One additional marker (greb1_05) is highly informative, but is not fixed in winter run fish. Fall run fish and half pounder demonstrate intermediate genotypes and very little differentiation from one another.

More heatmaps

Sometimes it can be helpful to look at the pattern across individuals and along the genome rather than allele frequency. Below we make a graphical representation of some of the genotypes at run timing associated markers as well.

#first we need to get a dataset that will work

#the tab slot of the adegenet object will work if we polarize the data by selecting the allele with the higher count in either winter or early summer

#for each pair of columns, keep the one with the highest average, few enough markers, we can just choose these manually
colSums(chr28_seppop$Winter$tab, na.rm = TRUE)
##  Chr28_11607954.G  Chr28_11607954.A  Chr28_11625241.A  Chr28_11625241.G 
##                62                18                 0                80 
##  Chr28_11632591.G  Chr28_11632591.A  Chr28_11658853.A  Chr28_11658853.C 
##                63                17                 0                80 
##  Chr28_11667578.T  Chr28_11667578.C  Chr28_11676622.T  Chr28_11676622.G 
##                 0                80                 0                80 
##  Chr28_11683204.G  Chr28_11683204.T  Chr28_11702210.T  Chr28_11702210.G 
##                 0                80                 0                80 
##  Chr28_11773194.A  Chr28_11773194.T    Omy_GREB1_05.T    Omy_GREB1_05.G 
##                72                 8                14                66 
##    Omy_GREB1_09.G    Omy_GREB1_09.T Omy_RAD15709-53.G Omy_RAD15709-53.A 
##                80                 0                80                 0
#bind these together
polarized_allele_counts <- as.data.frame(chr28_seppop$Winter$tab[,c(1,4,5,8,10,12,14,16,17,20,21,23)]) %>%
  bind_rows(as.data.frame(chr28_seppop$Summer$tab[,c(1,4,5,8,10,12,14,16,17,20,21,23)])) %>%
  bind_rows(as.data.frame(chr28_seppop$fall$tab[,c(1,4,5,8,10,12,14,16,17,20,21,23)])) %>%
  bind_rows(as.data.frame(chr28_seppop$halfpounder$tab[,c(1,4,5,8,10,12,14,16,17,20,21,23)]))


#plot heatmap

#first, order the markers according to genomic position
map_results <- readxl::read_xlsx("metadata/final_mapping_results.xlsx", sheet = 1)
map_results <- map_results %>%
  mutate(marker = str_replace(marker, "Omy28", "Chr28"))

colnames(polarized_allele_counts) <- substr(colnames(polarized_allele_counts), 0, nchar(colnames(polarized_allele_counts))-2)

marker_pos <- map_results %>%
  select(marker, CRITFC_SNP_pos_genome) %>%
  filter(marker %in% colnames(polarized_allele_counts))

#hmm one marker position is missing, just add it manually
marker_pos$CRITFC_SNP_pos_genome <- as.numeric(marker_pos$CRITFC_SNP_pos_genome)
## Warning: NAs introduced by coercion
marker_pos[12,2] <- 11702210

#order to columns
polarized_allele_counts <- polarized_allele_counts %>%
  select(unlist(c(marker_pos[order(marker_pos$CRITFC_SNP_pos_genome),1]), use.names = FALSE))
#add pop info
polarized_allele_counts$pop <- c(rep("winter", nrow(chr28_seppop$Winter$tab)), rep("summer", nrow(chr28_seppop$Summer$tab)), rep("fall", nrow(chr28_seppop$fall$tab)), rep("halfpounder", nrow(chr28_seppop$halfpounder$tab)))

#split into groups for easy visualization
winter_pac <- filter(polarized_allele_counts, pop == "winter")
summer_pac <- filter(polarized_allele_counts, pop == "summer")
fall_pac <- filter(polarized_allele_counts, pop == "fall")
halfpounder_pac <- filter(polarized_allele_counts, pop == "halfpounder")

a = pheatmap(winter_pac[,-13], cluster_cols = FALSE, show_rownames = FALSE, main = "winter")

b = pheatmap(summer_pac[,-13], cluster_cols = FALSE, show_rownames = FALSE, main = "summer")

c = pheatmap(fall_pac[,-13], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")

d = pheatmap(halfpounder_pac[,-13], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")

#same as above but for diag only
fall_pac_diag <- select(fall_pac, -c("Chr28_11607954", "Omy_GREB1_05", "Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194"))


#repolarize the -54 marker
#fall_pac_diag <- fall_pac_diag %>%
#  mutate(`Omy_RAD47080-54` = 2 - `Omy_RAD47080-54`)

c = pheatmap(fall_pac_diag[,-8], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")

half_pac_diag <- select(halfpounder_pac, -c("Chr28_11607954", "Omy_GREB1_05", "Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194"))


#repolarize the -54 marker
#fall_pac_diag <- fall_pac_diag %>%
#  mutate(`Omy_RAD47080-54` = 2 - `Omy_RAD47080-54`)

c = pheatmap(half_pac_diag[,-8], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")

# same as above but excluding markers with weird missingness patterns and getting rid of the duplicate (-54)

#fall_pac_diag_mo_miss <- select(fall_pac_diag, -c("Omy_RAD47080-54", "Chr28_11671116"))

#fall_pac_diag_2019 <- fall_pac_diag %>%
#  rownames_to_column(var = "sample") %>%
#  filter(str_detect(sample, "OmyAC19"))

#fall_pac_diag_2020 <- fall_pac_diag %>%
#  rownames_to_column(var = "sample") %>%
#  filter(str_detect(sample, "OmyAC20"))

#counting all winter fall run
#sum(rowSums(fall_pac_diag_mo_miss[,-8], na.rm = TRUE) ==14)
#counting all summer fall run
#sum(rowSums(fall_pac_diag_mo_miss[,-8], na.rm = TRUE) == 0)

#same for half
#half_pac_diag_mo_miss <- select(half_pac_diag, -c("Omy_RAD47080-54", "Chr28_11671116"))

#counting all winter fall run
#sum(rowSums(half_pac_diag_mo_miss[,-8], na.rm = TRUE) ==14)
#counting all summer fall run
#sum(rowSums(half_pac_diag_mo_miss[,-8], na.rm = TRUE) == 0)

#again but split by year
#half_pac_diag_mo_miss_2018 <- half_pac_diag_mo_miss %>%
#  rownames_to_column(var = "Sample") %>%
#  left_join(select(genos_2.3, Sample, year)) %>%
#  filter(year == "2018")

#half_pac_diag_mo_miss_2019 <- half_pac_diag_mo_miss %>%
#  rownames_to_column(var = "Sample") %>%
#  left_join(select(genos_2.3, Sample, year)) %>%
#  filter(year == "2019")

#here we count the number of samples that uncontroversially fall into summer and winter bins (no missing data, only winter/sumer alleles)
#sum(rowSums(half_pac_diag_mo_miss_2018[,-c(1,9,10)], na.rm = TRUE) ==14)
#sum(rowSums(half_pac_diag_mo_miss_2018[,-c(1,9,10)], na.rm = TRUE) ==0)


#sum(rowSums(half_pac_diag_mo_miss_2019[,-c(1,9,10)], na.rm = TRUE) ==14)
#sum(rowSums(half_pac_diag_mo_miss_2019[,-c(1,9,10)], na.rm = TRUE) ==0)

Definitely some interesting patterns here. It might be more useful to explore this with inferred haplotypes, so we’ll hold off for now.

Omy05 Markers

The second DA in the a priori DAPC captures less among group variation than the first, however, it is oriented along a large axis of genetic variation within both the fall run and halfpounders. Part of this axis is driven by snps with known association with residency vs anadromy. Let’s explore this briefly:

First let’s take a look at the allele frequencies at residency vs anadromy alleles on chromosome 5

residency_loci_names <- marker_mapping %>%
  filter(chr == "5" | CRITFC_chromosome == "5") %>%
  filter(str_detect(`Presumed Type`, 'residency|Residency')) %>%
  select(marker)

residency_loci_names <- residency_loci_names$marker

#one of these is not in e final dataset, remove
residency_loci_names <- residency_loci_names[-6]

# get summary info
residency_counts <- allele.dist(genind_2.1[loc=residency_loci_names], mk.figures = FALSE)$count
#make into a dataframe
residency_counts <- as.data.frame(do.call(rbind, residency_counts))
colnames(residency_counts) <- c("fall_count", "half_count", "summer_count", "winter_count")
residency_counts$sum <- rowSums(residency_counts, na.rm = TRUE)

residency_freqs <- allele.dist(genind_2.1[loc=residency_loci_names], mk.figures = FALSE)$frequency
#make into a dataframe
residency_freqs <- as.data.frame(do.call(rbind, residency_freqs))

residency_freqs <- as.data.frame(cbind(residency_freqs, residency_counts))

##### get only minor allele
#then put the marker names in the same order as the allele.dist (exlude the missing marker first)
residency_freqs$marker <- rep(residency_loci_names[order(residency_loci_names)], each = 2)

# the residency markers were switched on the probe seqs file during genotype calling, let's fix this here, since none of the published results share marker names up to this point, let's leave the analysis as is and fix at this stage. 
# here the info from the corrections
# "
# R40252 and R40319 are correct.
# What I labeled R19198 is actually R14589.
# What I labeled R14589 is actually R33562.
# What I labeled R33562 is actually R24370.
# What I labeled R24370 is actually R19198.
# "
#and here we rename
residency_freqs$marker[c(3,4)] <- "OmyR14589"
residency_freqs$marker[c(1,2)] <- "OmyR33562"
residency_freqs$marker[c(7,8)] <- "OmyR24370"
residency_freqs$marker[c(5,6)] <- "OmyR19198"

#now group by marker and keep the minor allele, then convert counts to 
residency_maf <- residency_freqs %>%
  group_by(marker) %>%
  slice_min(sum) %>%
  replace(., is.na(.), 0)



#plot as a heatmap
residency_maf <- residency_maf %>%
  rename("Half-pounder" = "halfpounder", "Early-Summer" = "Summer", "Late-Summer" = "fall")
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(residency_maf[,1:4]))
colnames(tmat) <- residency_maf$marker
pheatmap(tmat)

Fall and winter fish vary (somewhat: p = 0.3 vs 0.6) at these residency alleles, with summer and halfpounder intermediate.

Let’s look at allele frequency variance at these alleles as well:

res_divs <- filter(marker_divs, marker %in% residency_loci_names)
res_divs <- pivot_wider(res_divs, names_from = obs_exp, values_from = H)

ggplot(data=res_divs)+geom_point(aes(He, Ho, color = run), size = 3, alpha = 0.8)+scale_color_viridis_d()+theme_classic()+geom_abline(aes(slope =1, intercept = 0))+xlim(0,0.5)+ylim(0,0.5)

We also know what alleles are the resident alleles. Let’s repolarize the figure above so that higher freq indicates more resident allele dosage

#we'll go row by row in the resident freqs dataframe and choose the resident allele
pol_res_freq <- residency_freqs[c(1,4,6,8,10),]


#plot as a heatmap
pol_res_freq2 <- pol_res_freq %>%
  rename("Half-pounder" = "halfpounder", "Early-Summer" = "Summer", "Late-Summer" = "fall")
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(pol_res_freq2[,1:4]))
colnames(tmat) <- pol_res_freq$marker
pheatmap(tmat)

Is there a sex-bias here?

# convert individual genotype to polarized residency allele count (more counts = more resident)

#Marker Name    Resident Allele Anadromous Allele
#OmyR14589  T   A
#OmyR19198  G   A
#OmyR24370  G   A
#OmyR33562  A   G
#OmyR40252  T   A
#OmyR40319  C   T

#fix marker names
res_genos <- genos_2.3 %>%
  select(Sample, run, year, OmyY1_2SEXY,OmyR14589, OmyR19198, OmyR24370, OmyR33562, OmyR40252) %>%
  rename( "OmyR14589" = "OmyR19198", "OmyR33562" = "OmyR14589", "OmyR24370" = "OmyR33562", "OmyR19198" = "OmyR24370")


res_genos %<>%
  mutate(OmyR14589 = case_when(OmyR14589 == "TT" ~ 2,
                               OmyR14589 == "AA" ~ 0,
                               OmyR14589 == "TA" ~ 1)) %>%
  mutate(OmyR33562 = case_when(OmyR33562 == "AA" ~ 2,
                               OmyR33562 == "GG" ~ 0,
                               OmyR33562 == "AG" ~ 1)) %>%
  mutate(OmyR19198 = case_when(OmyR19198 == "GG" ~ 2,
                               OmyR19198 == "AA" ~ 0,
                               OmyR19198 == "GA" ~ 1)) %>%
  mutate(OmyR24370 = case_when(OmyR24370 == "GG" ~ 2,
                               OmyR24370 == "AA" ~ 0,
                               OmyR24370 == "GA" ~ 1)) %>%
  mutate(OmyR40252 = case_when(OmyR40252 == "TT" ~ 2,
                               OmyR40252 == "AA" ~ 0,
                               OmyR40252 == "TA" ~ 1))
  
res_genos %<>%
  mutate(res_sum =OmyR14589 + OmyR33562 + OmyR19198 + OmyR24370 + OmyR40252)

res_genos_complete <- res_genos %>%
  filter(OmyY1_2SEXY != "00") %>%
  filter(is.na(res_sum) == FALSE)

summary(glm(as.factor(OmyY1_2SEXY) ~ res_sum, data = res_genos_complete, family = "binomial"))
## 
## Call:
## glm(formula = as.factor(OmyY1_2SEXY) ~ res_sum, family = "binomial", 
##     data = res_genos_complete)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9702  -0.9389  -0.9083   1.4363   1.4728  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.50920    0.12953  -3.931 8.45e-05 ***
## res_sum     -0.01629    0.01883  -0.865    0.387    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1109.6  on 853  degrees of freedom
## Residual deviance: 1108.9  on 852  degrees of freedom
## AIC: 1112.9
## 
## Number of Fisher Scoring iterations: 4
ggplot(data = res_genos)+geom_boxplot(aes(x = OmyY1_2SEXY, y = res_sum))
## Warning: Removed 50 rows containing non-finite values (stat_boxplot).

So variation is high within all the runs (least so at fall run), with expected heterozygosity all the way near 0.5. Interestingly there is a dearth of heterozygotes in all but summer run, however, this is significant only for halfpounder where it is the strongest. This pattern suggests that there is some structuring with the halfpounders and to a lesser extent fall run (but not summer and winter run fish), with reduced interbreeding between individuals expressing alternative residency alleles. interpreting the differences across populations suggests there’s varying degrees of gene flow between anadromous and resident fish across the run types, with the highest proportion of resident alleles in the late-summer run samples. this may be explained by the temporal and spatial variation in spawning between run types. hard to say for sure without better sampling, but worth discussing.

also something to think about here is that some state there is a tendency of hatchery released smolts to residualize over the summer, winter fish are graded (i.e. precocially mature males are not released) so is it likely to be the summer hatchery fish contributing the residuals? also of note here is that our winter sample includes both hatchery and natural origin fish, while our other samples contain only natural origin, so the pattern here could be due to hatchery alleles. ultimately we need better sampling to get at these questions.

LD

Quick LD plot using r(bar)D of winter and summer runs (first plot) vs fall and half pounder (second) plot. Will come back and clean this up later. Absence of variation within some runs is causing code to break for plotting and estimating LD in a lot of packages. If we want formal LD, I think I’ll need to manually convert to a plink or vcf format and estimate D or r2 with plink/tassel or other .

chr28_genind <- genind_2.1[loc = run_timing_loci_names]

# first lets calculate LD (dartR has a great (fast) ld estimator that works right on genind files, so let's use this)
#ldreport_winter <- dartR::gl.report.ld(chr28_genind[pop = c("Summer", "Winter")], name = NULL, save = FALSE )
#ldreport_fall <- dartR::gl.report.ld(chr28_genind[pop = c("fall", "halfpounder")], name = NULL, save = FALSE )

#this command is breaking when running on the subset datasets (no variance to do the calcs on), lets settle on poppr's tools for now (eventually will need a better LD estimate)

invisible(ldr <- poppr::pair.ia(chr28_genind[pop = c("Summer", "Winter")], limits = c(-0.1, 1.1)))

invisible(ldr_f <- poppr::pair.ia(chr28_genind[pop = c("fall", "halfpounder")], limits = c(-0.1, 1.1)))

# reorder winter / summer
ldr_f <- rownames_to_column(as.data.frame(ldr_f), var = "rowid")
ldr_f <- ldr_f %>%
  separate(rowid, into = c("snp1", "snp2"), sep = ":")

ldr_f$snp1 <- factor(ldr_f$snp1, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)
ldr_f$snp2 <- factor(ldr_f$snp2, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)

ldr_f_opposite_tri <- ldr_f
colnames(ldr_f_opposite_tri) <- c("snp2", "snp1", "rbarD")
ldr_f <- ldr_f %>%
  bind_rows(ldr_f_opposite_tri)

ggplot(ldr_f)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_gradient(low = "#bdc3c7", high= "#2c3e50")

# reorder fall_halfpounder
ldr <- rownames_to_column(as.data.frame(ldr), var = "rowid")
ldr <- ldr %>%
  separate(rowid, into = c("snp1", "snp2"), sep = ":")

ldr$snp1 <- factor(ldr$snp1, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)
ldr$snp2 <- factor(ldr$snp2, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)

ldr_opposite_tri <- ldr
colnames(ldr_opposite_tri) <- c("snp2", "snp1", "rbarD")
ldr <- ldr %>%
  bind_rows(ldr_opposite_tri)

ggplot(ldr)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_gradient(low = "#bdc3c7", high= "#2c3e50")

now let’s do the same but skip the uninformative markers for better visualization

uninf_markers <- c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")

ldr <- ldr %>%
  filter(!(snp1 %in%  uninf_markers)) %>%
  filter(!(snp2 %in%  uninf_markers))


ggplot(ldr)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_viridis_c( limits = c(0,1))

ldr_f <- ldr_f %>%
  filter(!(snp1 %in%  uninf_markers)) %>%
  filter(!(snp2 %in%  uninf_markers))

ggplot(ldr_f)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_viridis_c(limits = c(0,1))

Yes, LD is weaker among fall and half pounder than winter/summer.

LD is also about equally strong over the entire length of the region, there are no clear ld blocks within the region.

Omy05
Same but for residency markers.

res_genind <- genind_2.1[loc = residency_loci_names]

# first lets calculate LD (dartR has a great (fast) ld estimator that works right on genind files, so let's use this)
ldreport_winter <- dartR::gl.report.ld(dartR::gi2gl(res_genind[pop = c("Summer", "Winter")], verbose = 0), name = NULL, save = FALSE, verbose = 0 )
##   Processing a SNP dataset
## Start conversion....
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using >save(object, file="object.rdata")
ldreport_fall <- dartR::gl.report.ld(dartR::gi2gl(res_genind[pop = c("fall", "halfpounder")], verbose = 0), name = NULL, save = FALSE, verbose = 0 )
##   Processing a SNP dataset
## Start conversion....
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using >save(object, file="object.rdata")
ldreport_half <- dartR::gl.report.ld(dartR::gi2gl(res_genind[pop = c("halfpounder")], verbose = 0), name = NULL, save = FALSE, verbose = 0 )
##   Processing a SNP dataset
## Start conversion....
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using >save(object, file="object.rdata")
ldreport_total <- dartR::gl.report.ld(dartR::gi2gl(res_genind, verbose = 0), name = NULL, save = FALSE, verbose = 0 )
##   Processing a SNP dataset
## Start conversion....
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using >save(object, file="object.rdata")
#ldreport <- dartR::gl.report.ld(dartR::gi2gl(genind_2.1, verbose = 0), name = NULL, save = FALSE, nchunks = 2, ncores = 3, chunkname = NULL, verbose = 0)

R2

The approach above uses r(bar)D, for convenience, but we should probably use a more appropriate measure of LD and be consistent with the filtering approach we used for structure. Let’s take the time to properly plot r2

ldreport_sw <- dartR::gl.report.ld(dartR::gi2gl(chr28_genind[pop = c("Summer", "Winter")], verbose = 0), name = NULL, save = FALSE, verbose = 0 )
##   Processing a SNP dataset
## Start conversion....
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using >save(object, file="object.rdata")
#now we need to add loci names back to this report
name_key <- data.frame(names(chr28_genind$loc.n.all), c(1:12))
colnames(name_key) <- c("marker", "id")

ldreport_sw %<>%
  left_join(name_key, by = c("loc1"="id")) %>%
  rename(marker_1 = marker) %>%
  left_join(name_key, by = c("loc2"="id")) %>%
  rename(marker_2 = marker) 

#now get Omy28 marker info


ldreport_sw %<>%
  left_join(select(marker_mapping2, marker, CRITFC_SNP_pos_genome), by = c("marker_1" = "marker")) %>%
  rename(marker1_position = CRITFC_SNP_pos_genome) %>%
  left_join(select(marker_mapping2, marker, CRITFC_SNP_pos_genome), by = c("marker_2" = "marker")) %>%
  rename(marker2_position = CRITFC_SNP_pos_genome) %>%
  mutate(marker_1 = fct_reorder(marker_1, marker1_position)) %>%
  mutate(marker_2 = fct_reorder(marker_2, marker2_position))

#mapping failed for one marker, let's put it in manually
ldreport_sw %<>%
  mutate(marker1_position = case_when(marker_1 == "Chr28_11702210" ~ "11702210",
                                      TRUE ~ marker1_position)) %>%
  mutate(marker2_position = case_when(marker_2 == "Chr28_11702210" ~ "11702210",
                                      TRUE ~ marker2_position))


# some markers are on the wrong side of the diagonal, let's print both sides
ldreport_sw_rev <- ldreport_sw
ldreport_sw_rev[, c("marker_1", "marker_2", "marker1_position", "marker2_position")] <- ldreport_sw_rev[, c("marker_2", "marker_1", "marker2_position", "marker1_position")]

ldreport_sw <- rbind(ldreport_sw, ldreport_sw_rev)


ldreport_sw %<>%
  mutate(marker_1 = fct_reorder(marker_1, marker1_position)) %>%
  mutate(marker_2 = fct_reorder(marker_2, marker2_position))


#half and fall
ldreport_hf <- dartR::gl.report.ld(dartR::gi2gl(chr28_genind[pop = c("fall", "halfpounder")], verbose = 0), name = NULL, save = FALSE, verbose = 0 )
##   Processing a SNP dataset
## Start conversion....
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using >save(object, file="object.rdata")
#now we need to add loci names back to this report
name_key <- data.frame(names(chr28_genind$loc.n.all), c(1:12))
colnames(name_key) <- c("marker", "id")

ldreport_hf %<>%
  left_join(name_key, by = c("loc1"="id")) %>%
  rename(marker_1 = marker) %>%
  left_join(name_key, by = c("loc2"="id")) %>%
  rename(marker_2 = marker) 

#now get Omy28 marker info


ldreport_hf %<>%
  left_join(select(marker_mapping2, marker, CRITFC_SNP_pos_genome), by = c("marker_1" = "marker")) %>%
  rename(marker1_position = CRITFC_SNP_pos_genome) %>%
  left_join(select(marker_mapping2, marker, CRITFC_SNP_pos_genome), by = c("marker_2" = "marker")) %>%
  rename(marker2_position = CRITFC_SNP_pos_genome) %>%
  mutate(marker_1 = fct_reorder(marker_1, marker1_position)) %>%
  mutate(marker_2 = fct_reorder(marker_2, marker2_position))

#mapping failed for one marker, let's put it in manually
ldreport_hf %<>%
  mutate(marker1_position = case_when(marker_1 == "Chr28_11702210" ~ "11702210",
                                      TRUE ~ marker1_position)) %>%
  mutate(marker2_position = case_when(marker_2 == "Chr28_11702210" ~ "11702210",
                                      TRUE ~ marker2_position))


# some markers are on the wrong side of the diagonal, let's print both sides
ldreport_hf_rev <- ldreport_hf
ldreport_hf_rev[, c("marker_1", "marker_2", "marker1_position", "marker2_position")] <- ldreport_hf_rev[, c("marker_2", "marker_1", "marker2_position", "marker1_position")]

ldreport_hf <- rbind(ldreport_hf, ldreport_hf_rev)


ldreport_hf %<>%
  mutate(marker_1 = fct_reorder(marker_1, marker1_position)) %>%
  mutate(marker_2 = fct_reorder(marker_2, marker2_position))

ggplot(data = filter(ldreport_sw ))+geom_tile(aes(marker_1, marker_2, fill = R2), size = 2)+scale_fill_viridis_c(option = "C")+theme_classic()+theme(axis.text.x = element_text(angle = 90))+ggtitle("Early-Summer\nand Winter Run")+coord_equal() 

ggplot(data = filter(ldreport_hf ))+geom_tile(aes(marker_1, marker_2, fill = R2), size = 2)+scale_fill_viridis_c(option = "C")+theme_classic()+theme(axis.text.x = element_text(angle = 90))+ggtitle("Late-Summer\nand Half-pounders")+coord_equal()

# 
# #plot
# a <- ggplot(data = filter(ldreport_sw ))+geom_tile(aes(marker_1, marker_2, fill = R2), size = 2)+scale_fill_viridis_c(option = "C")+theme_classic()+theme(axis.text.x = element_text(angle = 90))+ggtitle("Early-Summer\nand Winter Run")+coord_equal() 
# 
# b <- ggplot(data = filter(ldreport_hf ))+geom_tile(aes(marker_1, marker_2, fill = R2), size = 2)+scale_fill_viridis_c(option = "C")+theme_classic()+theme(axis.text.x = element_text(angle = 90))+ggtitle("Late-Summer\nand Half-pounders")+coord_equal()
# 
# plot_grid(a,b,ncol=1,labels="auto")

haplotypes

The identification of “discordant” genotypes at markers in the region on chromosome 28 associated with run timing (i.e. winter and summer run fish demonstrate near perfect linkage among markers in this region, but fall and halfpounders demonstrate reduced LD), suggests that recombination can occur within this region. If we can phase our data we should be able to identify chr28 haplotypes to gain more inference into the evolutionary origin/maintenance of fall/halfpounders.

Outline:
- identify snps in linkage around chr28 and chr05
- phase genotypes for these snps
- cluster phased genos to identify haplogroups and/or build haplotype network to visualize relationships and visually identify clusters

phasing

First we need to get phased genotypic data in order to identify haplotypes.

We will use fastPHASE. First lets get data in the fastphase format

#when I reran the analysis after removing the redudant marker and the marker with population specific missingness, instead of making these input files from scratch I simply deleted the rows from the input file for fastphase. 

# the fast phase format's genotype data for an individual is expressed on two rows with each row representing an unphased haploid genotype. There is also some metadata added into header rows. the main gt matrix  is very similar to the structure format so we'll use the same approach to reformat the data as we used to convert from genind to structure

chr28_genind <- genind_2.1[loc=run_timing_loci_names]

#note just sort of crashed through this with a text editor, not easily logged, but the general idea was transpose the data, split columns (diploid to dual haploid by adding a tab between the values), then convert N to ?, then read the result into R and transpose again

df <- genind2df(chr28_genind)

#reorder to chromosomal order here
#first, order the markers according to genomic position
map_results <- readxl::read_xlsx("metadata/final_mapping_results.xlsx", sheet = 1)
map_results <- map_results %>%
  mutate(marker = str_replace(marker, "Omy28", "Chr28"))

marker_pos <- map_results %>%
  select(marker, CRITFC_SNP_pos_genome) %>%
  filter(marker %in% colnames(df))

#hmm one marker position is missing, just add it manually
marker_pos$CRITFC_SNP_pos_genome <- as.numeric(marker_pos$CRITFC_SNP_pos_genome)
marker_pos[12,2] <- 11702210

#order to columns
df <- df %>%
  select(unlist(c(marker_pos[order(marker_pos$CRITFC_SNP_pos_genome),1]), use.names = FALSE))

# now transponse and save
df <- as.data.frame(t(df))
write_tsv(df, "genotype_data_2021/chr28.tmp")

#in text editor, convert NA to ??, then split columns
df <- read_tsv("genotype_data_2021/chr28.tmp", col_names = FALSE)
df <- t(df)
write_tsv(as.data.frame(df), "genotype_data_2021/chr28.fastphase", col_names = FALSE)

#next we remove whitespaces from the genos and label the individuals like so:
# #individual 1 name
# TAGCG...
# TAGCC...
# #individual 2 name
# TAGCG...
# TAGCC...

# this was accomplished by removing all the whitespace in the genos then using the following regex in find and replace
# find: ^(\w+)\t([A-Z?]+)\n^(\w+)\t([A-Z?]+)
# replace: # \1\n\2\n\4

#then we added the header info according to the fastphase manual
# snp pos was taken from the critfc mapping data and manually added to the file

fastphase was run using default setting and attempting to minimize the switch error (Add more details on this for eventual methods)

 ~/Science/programs/fastPHASE  -n ./genotype_data_2021/chr28.fastphase

haplotype network

Here we will examine the diversity and putative evolutionary history of the haplotypes.

ph_geno <- read_tsv("phasing/2021/phased_genos_for_r.txt")

First lets collect some summary info:

ph_geno <- genos_2.3 %>%
  select(Sample, run) %>%
  left_join(ph_geno, by = c("Sample" = "ind"))

ph_geno %>%
  group_by(run) %>% 
  summarise(count = n_distinct(hap))
#write_tsv(ph_geno, "phasing/phased_genos_run.txt")

There are 2 unique haplotypes in summer sample, 9 in winter, 40 in fall and 52 in halfpounders.

We constructed a haplotype network using the minimum spanning approach in Popart v1.7 (also tried to use pegas, but data import and conversion kept unphasing the data or and scrambling the sample ids across the different haplotypes)

Below is the minimum spanning network of the 84 different haplotypes among greb1l region SNPS inferred using fastphase and popart (msn). Log in code chunk below

skipped this in later runs of the analysis because we are really only inteerested in the haplotypes inferred at the informative markers

knitr::include_graphics('phasing/phased_genos2.phylip.svg')

# log

# this network was produced outside of r using popart 1.7. briefly, raw phased genotypes were converted to a phylip format (./phasing/phased_genos2.phylip) and run type was formatted as the popart trait data format (see ./phasing/traits2.txt), then we imported both, set the colors according to the viridis pallete used throughout the notebook, and ran the minimum spanning network algorithm to infer a graph/network
### DID NOT USE THIS ###

# didn't use any of this because row ids and haplotype assignments keep getting scrambled and producing misleading results

phased_loci <- read.loci("phasing/phased_genos_run.txt", header = FALSE, col.pop = 1, col.loci = 2:15)
haps <- haplotype(phased_loci, check.phase = TRUE, )
net <- haploNet(haps)

# structure -> df -> haplotype
phased_loci <- read.structure("phasing/phased_genos_run.stru", n.ind = 882, n.loc = 14, onerowperind = FALSE, col.pop = 1, col.lab = 0, col.others = 0, row.marknames = 0)
#phased_loci <- alleles2loci(phased_loci$tab,  phased = TRUE)
df <- genind2df(phased_loci)
h <- haplotype(as.matrix(df))
hn <- haploNet(h)
plot(hn, fast = FALSE)
#seems like phasing is getting lost in here somewhere because there are only 88 unique haps in the raw data

# this approach (below) works, but it keeps the haplotypes as characters making the distance calculation inappropriate
phased_loci <- read_tsv("phasing/phased_genos_run.txt")
hap <- haplotype(as.matrix(phased_loci[,-1]))
hap <- sort(hap, what = "labels")
dst <- dist.dna(hap)
dst2 <- dist
tree <- rmst(dst)

ind.hap<-with(
    stack(setNames(attr(hap, "index"), row.names(hap))), 
    table(hap=ind, pop=phased_loci$pop)
)

plot(tree, size=attr(tree, "weight"), scale.ratio = 10, labels= FALSE)

plot(tree, size=attr(tree, "weight"), scale.ratio = 100, cex = 0.8, pie=ind.hap, labels = F)
legend(50,50,colnames(ind.hap), col=rainbow(ncol(ind.hap)), pch=20)

#
hn2 <- haploNet(hap)

countHap <- function(hapl = hap, dna = phased_loci){
    with(
        stack(setNames(attr(hapl, "index"), rownames(hapl))),
        table(hapl = ind, pop = dna$pop)
    )
}


countHap()
plot(hn2, pie = countHap(), size = attr(hn2, "freq"), labels= F, scale.ratio = 100 )
legend(50,50,colnames(countHap()), col=rainbow(ncol(countHap())), pch=20)

haplotype plots

Here we make a couple rough plots to see if the haplotypes make sense.

#added column names to ph_geno dataframe to reflect SNPs

phased_genos <- read_tsv("phasing/2021/phased_genos_run.txt")

#polarize to winter (find the most common allele in winter convert to 1, then convert alternative allele to 0)

#gather the alleles for winter
phased_genos %>%
  filter(run == "Winter") %>%
  count(Chr28_11667578, Omy_GREB1_09, Chr28_11607954, Omy_GREB1_05, Chr28_11625241, Chr28_11632591, Chr28_11658853, `Omy_RAD15709-53`, Chr28_11676622, Chr28_11683204, Chr28_11773194, Chr28_11702210) %>%
  slice(which.max(n))
#convert to polarized counts, just did this manually because there are only 14 of them and it was faster (sorry for the hardcoding later David...)
phased_genos <- phased_genos %>%
  mutate(Chr28_11667578 = if_else(Chr28_11667578 == "C", 1, 0)) %>%
  mutate(Omy_GREB1_09 = if_else(Omy_GREB1_09 == "G", 1, 0)) %>%
  mutate(Chr28_11607954 = if_else(Chr28_11607954 == "G", 1, 0)) %>%
  mutate(Omy_GREB1_05 = if_else(Omy_GREB1_05 == "G", 1, 0)) %>%
  mutate(Chr28_11625241 = if_else(Chr28_11625241 == "G", 1, 0)) %>%
  mutate(Chr28_11632591 = if_else(Chr28_11632591 == "G", 1, 0)) %>%
  mutate(Chr28_11658853 = if_else(Chr28_11658853 == "C", 1, 0)) %>%
  mutate(`Omy_RAD15709-53` = if_else(`Omy_RAD15709-53` == "G", 1, 0)) %>%
  mutate(Chr28_11676622 = if_else(Chr28_11676622 == "G", 1, 0)) %>%
  mutate(Chr28_11683204 = if_else(Chr28_11683204 == "T", 1, 0)) %>%
  mutate(Chr28_11773194 = if_else(Chr28_11773194 == "A", 1, 0)) %>%
  mutate(Chr28_11702210 = if_else(Chr28_11702210 == "G", 1, 0)) 

#lets get rid of the markers that are not diagnostic or heaviliy weighted in the a priori DAPC

phased_genos <- phased_genos %>%
 select(-one_of(c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")))

#lets get rid of the "problem markers"
phased_genos <- phased_genos %>%
 select(-one_of(c("Omy_RAD47080-54", "Chr28_11671116")))



winter_pac <- filter(phased_genos, run == "Winter")
summer_pac <- filter(phased_genos, run == "Summer")
fall_pac <- filter(phased_genos, run == "fall")
halfpounder_pac <- filter(phased_genos, run == "halfpounder")

a = pheatmap(winter_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "winter")

# b = pheatmap(summer_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "Summer") # this wont plot as there's no variation
c = pheatmap(fall_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")

d = pheatmap(halfpounder_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")

#now without the uninformative markers (i.e. only keep diagnostic + greb05)
#skipped this and made better figures in the next code chunk

# 
# phased_genos <- phased_genos %>%
#  select(-one_of(c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")))
# 
# winter_pac <- filter(phased_genos, run == "Winter")
# summer_pac <- filter(phased_genos, run == "Summer")
# fall_pac <- filter(phased_genos, run == "fall")
# halfpounder_pac <- filter(phased_genos, run == "halfpounder")
# 
# #note that b won't run now because there is no variation
# a = pheatmap(winter_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "winter")
# #b = pheatmap(summer_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "summer")
# c = pheatmap(fall_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")
# d = pheatmap(halfpounder_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")

Next let’s make some publication quality figures that more clearly express what is going on here. Goals: separate run types, within fall and half pounders separate into assignments groups, so that we can examine what types of haplotypes contribute to each, also within each plot make sure it is clustered

Plot below shows indiviudal haplotypes (from only informative loci) in the greb1L/rock1 region, haplotypes (rows) are hierarchically clustered, individual snps contributing to haplotypes (columns) are in genomic order, color is polarized so early summer is purple and winter is yellow.

#prepackaged heatmap tools are unlikely to work well given all the weird things we want to do, so lets just build our own ggplot output group by group (winter + summer + 3 each of fall and halfpounder)

#remove uninformative snps
#phased_genos <- phased_genos %>%
#  select(-one_of(c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")))


#get assignment results


# first get the assignments
ld1_2 <- as.data.frame(dapc_full_apriori$ind.coord) %>%
  rownames_to_column(var="sample") %>%
  left_join(pops_info)

# assignments using original DAPC
CIs <- ld1_2 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
assn <- ld1_2 %>%
  mutate(assignment = case_when(
    LD1 < CIs$hiCI[4]  ~ "winter", 
    LD1 > CIs$loCI[3] ~ "summer", 
    LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4] ~ "unassigned")) %>%
  select(sample, assignment)


#split the halfpounders and fall into groups by assignment
half_fall_ph <- phased_genos %>%
  filter(run == "halfpounder" | run == "fall") %>%
  left_join(assn)

fall_w <- half_fall_ph %>%
  filter(run == "fall" & assignment == "winter")

fall_s <- half_fall_ph %>%
  filter(run == "fall" & assignment == "summer")

fall_u <- half_fall_ph %>%
  filter(run == "fall" & assignment == "unassigned")

halfpounder_w <- half_fall_ph %>%
  filter(run == "halfpounder" & assignment == "winter")

halfpounder_s <- half_fall_ph %>%
  filter(run == "halfpounder" & assignment == "summer")

halfpounder_u <- half_fall_ph %>%
  filter(run == "halfpounder" & assignment == "unassigned")

winter_pac <- phased_genos %>%
  filter(run == "Winter")

summer_pac <- phased_genos %>%
  filter(run == "Summer")

# get by group clustering results
require(ggdendro)
dendro_winter <- as.dendrogram(hclust(d = dist(winter_pac[,-c(1,2)])))
dendro_summer <- as.dendrogram(hclust(d = dist(summer_pac[,-c(1,2)])))
dendro_fall_w <- as.dendrogram(hclust(d = dist(fall_w[,-c(1,2,11)])))
dendro_fall_s <- as.dendrogram(hclust(d = dist(fall_s[,-c(1,2,11)])))
dendro_fall_u <- as.dendrogram(hclust(d = dist(fall_u[,-c(1,2,11)])))
dendro_halfpounder_w <- as.dendrogram(hclust(d = dist(halfpounder_w[,-c(1,2,11)])))
dendro_halfpounder_s <- as.dendrogram(hclust(d = dist(halfpounder_s[,-c(1,2,11)])))
dendro_halfpounder_u <- as.dendrogram(hclust(d = dist(halfpounder_u[,-c(1,2,11)])))

#add row ids for ordering later
winter_pac <- rowid_to_column(winter_pac, "row_id")
summer_pac <- rowid_to_column(summer_pac, "row_id")
fall_w <- rowid_to_column(fall_w, "row_id")
fall_s <- rowid_to_column(fall_s, "row_id")
fall_u <- rowid_to_column(fall_u, "row_id")
halfpounder_w <- rowid_to_column(halfpounder_w, "row_id")
halfpounder_s <- rowid_to_column(halfpounder_s, "row_id")
halfpounder_u <- rowid_to_column(halfpounder_u, "row_id")

#############
# convert to long format, retain dendrogram order and snp order
##############

long_winter <- winter_pac %>%
  pivot_longer(cols = !(c("run", "sample", "row_id")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(winter_pac$row_id[order.dendrogram(dendro_winter)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_summer <- summer_pac %>%
  pivot_longer(cols = !(c("run", "sample", "row_id")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(summer_pac$row_id[order.dendrogram(dendro_summer)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_fall_w <- fall_w %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(fall_w$row_id[order.dendrogram(dendro_fall_w)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_fall_s <- fall_s %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(fall_s$row_id[order.dendrogram(dendro_fall_s)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53",  "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_fall_u <- fall_u %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(fall_u$row_id[order.dendrogram(dendro_fall_u)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578",  "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_half_u <- halfpounder_u %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(halfpounder_u$row_id[order.dendrogram(dendro_halfpounder_u)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_half_w <- halfpounder_w %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(halfpounder_w$row_id[order.dendrogram(dendro_halfpounder_w)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_half_s <- halfpounder_s %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(halfpounder_s$row_id[order.dendrogram(dendro_halfpounder_s)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD15709-53", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

############
#build heatmaps
############

heatmap_w <- ggplot(data = filter(long_winter), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Winter\n ")+xlab(element_blank())

heatmap_s <- ggplot(data = filter(long_summer), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+theme_classic()+scale_fill_gradient(low = "#440154FF", high = "#440154FF")+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Early-Summer\n ")+xlab(element_blank())

heatmap_fw <- ggplot(data = filter(long_fall_w), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Late-Summer\nWinter Assigned")+xlab(element_blank())

heatmap_fs <- ggplot(data = filter(long_fall_s), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Late-Summer\nEarly-Summer ssigned")+xlab(element_blank())

heatmap_fu <- ggplot(data = filter(long_fall_u), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Late-Summer\nUnassigned")+xlab(element_blank())

heatmap_hw <- ggplot(data = filter(long_half_w), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Halfpounder\nWinter Assigned")+xlab(element_blank())

heatmap_hs <- ggplot(data = filter(long_half_s), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Halfpounder\nSummer Assigned")+xlab(element_blank())

heatmap_hu <- ggplot(data = filter(long_half_u), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme(axis.text.x = element_text(angle = 90), axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none")+ylab("Halfpounder\nUnassigned")


#make the final plot
require(gridExtra)
grid.arrange(heatmap_s, heatmap_w, heatmap_fs, heatmap_fw, heatmap_fu, heatmap_hs, heatmap_hw, heatmap_hu,  ncol = 1, heights = c(3,3,3,3,3,3,3,7), clip = FALSE)

haplotype network 2

Let’s also build a haplotype network only for the informative markers.

This involves removing the uninformative markers from the .phylip file and running popart again (See above)

Let’s also collect the same stats again

#here to remove the two unwanted markers deleted the corresponding columns in a texteditor, they are SNPs 1,4,5 and 12
# for example
#find string: \t(\w)(\w)(\w)(\w)(\w)(\w)(\w)(\w)(\w)(\w)(\w)(\w)$
#replace string: \t\2\3\6\7\8\9\10\11

#ph_geno <- read_tsv("phasing/phased_genos3_r.txt")
ph_geno <- read_tsv("phasing/2021/phased_genos2_r.txt")


ph_geno <- genos_2.3 %>%
  select(Sample, run) %>%
  left_join(ph_geno, by = c("Sample" = "ind"))

ph_geno %>%
  group_by(run) %>% 
  summarise(count = n_distinct(hap))
#did the same as above (removed the bad columns) from phased_genos.phylip and ran again

#ran popart on phased_genos4.phylip and traits2.txt
knitr::include_graphics('phasing/phased_genos4.phylip.svg')

more recombination in half or fall mean shortest distance to node on haplotype network containing known winter or early-summer haplotype

#popart output is very limited, just a list of members of each node, with the node named after the first member. to figure out the statistic we're interested in we'll have to do some manual work here

#for each named node, manually trace the shortest distance to a node with a early- or late-migrating haplotype

#then assign this value to all the members in the nodes

#then collect summary statistics on these per population (mean + sd + plot distribution)

early-summer: 0 winter: 0 late-summer

summary

The density of markers in greb1L/rock1 region in our GTseq dataset allows us to identify characteristic winter and early-summer run haplotypes from our samples. if we focus on only the sites that vary strongly between winter and summer runs, there is a single summer haplotype (note there are multiple haps within this group when considering all sites, not just the diagnostic + heavily weighted ones), and a handful (5?) haplotypes within winter run.

summer and winter assigned halfpounder and fall run fish possess largely winter and summer run haplotypes. unassigned halfpounders and fall run fish also possess 1 copy each of winter and summer run haplotypes, suggesting many unassigned fish are first generation hybrids between winter and summer runs, however, there are also many highly recombined haplotypes among the unassigned halfpounders and fall run fish. evidence of recombination within the greb1L/rock1 region haplotypes suggests gene flow between winter and summer run fish occurs beyond first generation hybrids. however we did not observe any recombined greb1L/rock1 region haplotypes within our sample of winter and summer run fish.

this opens up a question, what spawning events allow for recombination? we propose that there may be partial temporal isolation of early-summer from winter run fish, with fall run adults serving as a bridge between the extreme distribution of run timing. indeed we also discovered a weak but significant association between DAPC DA1 score and arrival timing among fall run fish.

this absence of recombined haplotypes within winter and summer samples may reflect extreme phenotype sampling (adult samples were identified as winter and summer runs by timing of arrival at spawning grounds, taken on far ends of distribution (late winters vs early early-summers))

this also calls into question what phenotype we are looking at, runs are named based on their timing of freshwater entry, but appear to also segregate at least weakly on the basis of spawning time/arrival at spawning grounds.

to continue thinking about differences between halfpounder and fall haplotype results?
where do halfpounders come in here?

Discussion

note that this reflects an initial discussion at the point of time of conducting the initial analysis logged here and is NOT the authors’ final interpretation of the results, which is presented in the manuscript

Two multivariate approaches (unconstrained ordination (PCA) and constrained ordination (DAPC)) as well as a model-based bayesian clustering approach (STRUCTURE) were used to interrogate population genetic structure among 3 runs of adult Rogue River steelhead as well as half-pounders. We also estimated genetic distances among run types using Fst. These approaches attempt to identify structure either (a) among a priori assigned grouping of organisms (in our case run timing), or (b) among de novo identified genetic clusters. We draw inferences on population genetic structure within the Rogue River based on the strong agreement among multivariate and bayesian as well as de novo and a priori approaches to characterizing population genetic structure. However, it is important to note that our results are based on the small portion of total genetic variation represented by our GTseq panel markers, and that these markers do not represent an unbiased representation of genetic variation among Rogue River steelhead. For example, “neutral” markers in our panel were chosen to capture genetic variation at wide spatial scales and not within individual drainages, so our study may underestimate the extent of population differentiation at neutral markers. Similarly, the inclusion of multiple markers associated with run-timing phenotypes allows us to capture strong differentiation at these genomic regions, but these regions are strong outliers across many population genetic parameters, are likely over represented in our dataset relative to neutral markers and generally will overestimate genome-wide genetic differentiation. Most critically, our dataset may exclude genetic variants that drive adaptive phenotypic variation within the Rogue River. As many of our approaches utilize different subsets of markers, and differ in their sensitivity to various biases in the dataset, it is important to consider each in light of its respective caveats and underlying assumptions.

No Difference Among Late-Summer Adults and Half-Pounders
While there is evidence of structure within half pounders and late-summer run steelhead (see discussion below), we did not find any evidence to suggest restricted gene flow among these two groups. Genetic differentiation (Fst) was low at both neutral markers and on average across the full GTseq dataset and there is little distance among group centroids in both the neutral and full PCA. STRUCTURE results suggest that half-pounders and late summer run fish derive similar proportions of their ancestry from different ancestral genetic clusters, regardless of the number of genetic clusters being modeled. Additionally, individuals within each group demonstrate similar variance in this metric. We also failed to discriminate among half-pounder and late summer run fish using a discriminant analysis of principal components of genetic variation in our dataset.

Structure Among Winter and Early-Summer Run Steelhead
In contrast, we identify population genetic structure among winter and early-summer run steelhead at both neutral and adaptive genetic markers. Using markers annotated as neutral in our GTseq dataset, total differentiation is low (Fst ~ 1%), and there is little to no differentiation apparent in the PCA, however, this low level of differentiation is sufficient to clearly discriminate among winter and summer run fish using DAPC.

When using the full dataset, structure among winter and early summer run fish is strong. When 3 or more ancestral genetic clusters are modeled, STRUCTURE estimates alternative genetic clusters that dominate the ancestry of winter and early summer run steelhead, although all individuals demonstrate some degree of admixture with the alternative population’s dominant cluster. Using DAPC, we are able to describe a single axis of genetic variation that strongly discriminate among winter and early-summer run fish. This axis is primarily driven by variation at markers with known associations to run-timing that are diagnostic or nearly diagnostic among winter and early-summer run steelehad, and, to a lesser extent, several neutral annotated markers and markers with known associations to residency vs anadromy.

Structure within Late-Summer and Half-Pounders
The matter of separating late-summer and half-pounder steelhead from early-summer and winter runs is less clear. While all analyses that identified structure within the Rogue River steelhead place half-pounder and late-summer run steelhead as intermediate between winter and early-summer steelhead at the group-level, there is (a) substantial overlap in genetic variation among both half-pounder and fall run steelhead with winter and early summer run steelhead and (b) evidence of population structure within half-pounder and late-summer run steelhead.

Inference of structure within both fall and half-pounder fish comes from several lines of evidence. Among half-pounders, and to a lesser extent late-summer run steelhead, many markers were out of HWE with a significant excess of homozygosity, suggestive of Wahlund effects. These markers were significantly enriched for run-timing associated markers, suggesting that structure within half-pounders is due in part to groups of either early-summer-like or winter-like individuals. Also, de novo identification of population genetic structure via k means clustering always separated late-summer and half-pounder individuals into genetic clusters with both late-summer and winter run adults, regardless of the number of clusters used. Finally, unlike k-means clustering, in STRUCTURE, individuals may derive their ancestry from multiple clusters, and we see that while there are no strong differences at the group level in ancestry, there is a high level of variation within each group.

Indeed, when we attempt to classify both half-pounder and late-summer run fish as winter-like or early-summer like using a linear combination of alleles that strongly discriminate winter and early summer run fish (discriminant axis 1), we observe a group of strongly winter-like fish, a second group of intermeditate fish and a small number of early-summer-like fish. Direct examination of genotypes at known-run timing markers reveals the same pattern. One caveat here that might be worth including in anything published for a wider audience: Methods like the DAPC, and even the association analyses that are the basis of the annotations for our panel, operate under an assumption of additivity. Without a clear, mechaninistic understanding of the g -> p map for run timing, we can not predict run-timing phenotypes from such heterozygous and discordant genotypes at run-timing associated markers. Variation in effect size among loci, dominance, epistasis, the effects of un-genotyped causal loci and plasticity all may be at play here.

add discussion about diagnostic vs highly weighted in DAPC and about difference in phenotypes used for annotation (CRITFC annotations are for interior pops and the phenotype is arrival timing on spawning grounds, whereas prince et al annotations are for freshwater entry) compare our resutls with those from the latest critfc paper and the coastal lineages

also add discussion about how some markers with known annotations are not diagnostic or not even informative, and this could have somehting to do with divergence in the summer haplotype as it spread (assuming a single origin)

In addition to structure within fall and half-pounders along the genetic axis differentiating early-summer and winter runs, we also identified structure within these groups driven by both neutral markers and adaptive markers, including all 5 markers in the dataset with residency-vs-anadromy annotations and a neutral annotated marker with strong residency-vs-anaadromy associations in the Klickitat River.

Taken together, we propose that both late-summer run and half-pounders are a mixed assemblage of individuals demonstrating genetic characteristics inclusive of winter-run, early-summer run and a highly heterozygous intermediate group. Also, despite mark-recapture evidence that early-summer and late-summer run adults are found together on the spawning grounds, genetic differentiation at GTseq panel markers suggests late-summer (and half-pounder) run adults are less differentiated from winter run than early-summer run adults at both neutral and adaptive regions of the genome.

Stray thought / hypothesis collecter

This section is a dumping ground for ideas if we want to move forward with this investigation, and stray thoughts that might warrant returning to in the future. Not edited, remove before sharing widely.

pattern within fall
we have dates for approximate freshwater entry for the fall run fish. is there a correlation in the winter-vs-summer genetic axis and date of entry? there is substantial overlap in the entry timing of fish referred to as winter and late-summer run, and substantial interannual variation in the peak of run timing within a group, taken together this suggests that some late late-summer fish could instead be early-arriving winter run fish. this would show up as a correlation between run timing genotype and arrival date with the fall fish. let’s check.

First lets fit a simple linear model

#intake files
half_2018_intake <- readxl::read_xlsx("metadata/2018_halfpounder/OmyJC18ROGR_STHP Intake form Spread sheet.xlsx", sheet = 3)
half_2019_intake <- readxl::read_xlsx("metadata/2019_halfpounder/STHP Intake form Spread sheet 2019.xlsx", sheet = 1)
fall_intake <- readxl::read_xlsx("metadata/2019_fall/Rogue Adult Summer and Winter (06 11 20).xlsx", sheet = 1)
summer_intake <- readxl::read_xlsx("metadata/2019_summer/Omy Rogue2019 steelhead datasheets.xlsx", sheet = 2)
winter_intake <- readxl::read_xlsx("metadata/2019_winter/StW Scales for DNA (06 17 19).xlsx", sheet = 3)
fall_2020_intake <- readxl::read_xlsx("metadata/2020_fall/Intake_0004_OmyAC20ROGR_ProgenyEntry.xlsx")

#merge intakes
# first clean them up a bit to make merging easier
half_2018_intake <- half_2018_intake[,c(1,6)]
colnames(half_2018_intake)[1] <- "ID"
half_2018_intake$run <- "halfpounder"
half_2018_intake$year <- "2018"
colnames(half_2018_intake) <- c("ID", "Date", "run", "year")

half_2019_intake <- half_2019_intake[,c(1,2)]
half_2019_intake$run <- "halfpounder"
half_2019_intake$year <- "2019"
colnames(half_2019_intake) <- c("ID", "Date", "run", "year")

summer_intake <- summer_intake[,c(2,3,6)]
colnames(summer_intake) <- c("ID", "Date", "run")
summer_intake$year <- "2019"

winter_intake <- winter_intake[,c(2,3,7)]
colnames(winter_intake) <- c("ID", "Date", "run")
winter_intake$year <- "2019"

fall_intake <- subset(fall_intake, Run == "Summer")
fall_intake <- fall_intake[,c(1,3,6)]
colnames(fall_intake) <- c("ID", "Date", "run")
fall_intake$run <- "fall"
fall_intake$year <- "2019"

fall_2020_intake <- fall_2020_intake[,c(2,5,13)]
colnames(fall_2020_intake) <- c("ID", "Date", "run")
fall_2020_intake$run <- "fall"
fall_2020_intake$year <- "2020"

meta_data <- bind_rows(half_2018_intake, half_2019_intake, fall_intake, fall_2020_intake, winter_intake, summer_intake)


cor_data <- meta_data %>%
  right_join(ld1_2, by = c("ID" = "sample" ))



cor_data$jdate <- as.numeric(format(as.POSIXlt(cor_data$Date, format = "%y-%m-%d"), "%j"))
ggplot(cor_data[cor_data$run=="fall",])+geom_point(aes(jdate,LD1))+theme_classic()+geom_smooth(aes(jdate,LD1), method = "lm")

summary(lm(LD1~as.numeric(jdate), data=cor_data[cor_data$run=="fall",]))
## 
## Call:
## lm(formula = LD1 ~ as.numeric(jdate), data = cor_data[cor_data$run == 
##     "fall", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.00554 -0.74135 -0.02741  0.66296  2.76925 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        2.287109   0.809422   2.826  0.00507 **
## as.numeric(jdate) -0.009674   0.003283  -2.947  0.00349 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8918 on 273 degrees of freedom
## Multiple R-squared:  0.03082,    Adjusted R-squared:  0.02727 
## F-statistic: 8.682 on 1 and 273 DF,  p-value: 0.003491
summary(lm(LD1~as.numeric(jdate), data=cor_data[cor_data$run=="halfpounder",]))
## 
## Call:
## lm(formula = LD1 ~ as.numeric(jdate), data = cor_data[cor_data$run == 
##     "halfpounder", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0713 -0.8508 -0.2630  0.7316  3.1514 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        2.031143   0.847776   2.396   0.0169 *
## as.numeric(jdate) -0.008369   0.003385  -2.472   0.0137 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.082 on 641 degrees of freedom
## Multiple R-squared:  0.009445,   Adjusted R-squared:  0.0079 
## F-statistic: 6.112 on 1 and 641 DF,  p-value: 0.01369
ggplot(cor_data[cor_data$run=="halfpounder",])+geom_point(aes(jdate,LD1))+theme_classic()+geom_smooth(aes(jdate,LD1), method = "lm")

cor_data %>% group_by(run) %>%
  summarise(r = range(jdate))
#figure demonstrating why similar analysis can not be done in half-pounders (year confounded with date)
ggplot(data = filter(cor_data, pop == "halfpounder"))+geom_histogram(aes(x=as.numeric(jdate), color = year))

ggplot(data = filter(cor_data, pop == "fall"))+geom_histogram(aes(x=as.numeric(jdate), color = year))

There is a significant, but minor trend (r2 = 0.03, p = 0.04) towards more genetically winter-like fish over time among fall run fish.

I have multiple years now, and there may be year to year variation in how the genetics at play affects the migration timing. A better approach may be linear mixed model, where year is fit as a random effect (we are interested in the variance among years, not it’s fixed effect on the mean arrival date)

fall_mixed <- lme4::lmer(jdate ~ LD1 + (1|year), data = filter(cor_data, run =="fall"))
fall_reduced <- lme4::lmer(jdate ~ 1 + (1|year), data = filter(cor_data, run =="fall"))

anova(fall_reduced, fall_mixed)
## refitting model(s) with ML (instead of REML)
summary(fall_mixed)
## Linear mixed model fit by REML ['lmerMod']
## Formula: jdate ~ LD1 + (1 | year)
##    Data: filter(cor_data, run == "fall")
## 
## REML criterion at convergence: 2237.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.3569 -0.4556  0.0537  0.4225  3.6771 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  year     (Intercept) 124      11.14   
##  Residual             201      14.18   
## Number of obs: 275, groups:  year, 2
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 244.5472     7.9221  30.869
## LD1          -3.7013     0.9488  -3.901
## 
## Correlation of Fixed Effects:
##     (Intr)
## LD1 0.012
ggplot(data = filter(cor_data, run =="fall"))+geom_point(aes(LD1, jdate, color = year))+geom_smooth(aes(LD1, jdate, color = year), method = "lm")
## `geom_smooth()` using formula 'y ~ x'

lme4::ranef(fall_mixed)
## $year
##      (Intercept)
## 2019    7.827215
## 2020   -7.827215
## 
## with conditional variances for "year"
lme4::fixef(fall_mixed)
## (Intercept)         LD1 
##  244.547152   -3.701266
require(ggeffects)
## Loading required package: ggeffects
## Warning: package 'ggeffects' was built under R version 3.6.2
## 
## Attaching package: 'ggeffects'
## The following object is masked from 'package:cowplot':
## 
##     get_title
# Extract the prediction data frame
pred.mm <- ggpredict(fall_mixed, terms = c("LD1"))  # this gives overall predictions for the model

# Plot the predictions 

ggplot(pred.mm) + 
   geom_line(aes(x = x, y = predicted)) +          # slope
   geom_ribbon(aes(x = x, ymin = predicted - std.error, ymax = predicted + std.error), 
               fill = "lightgrey", alpha = 0.5) +  # error band
   geom_point(data = filter(cor_data, run =="fall"),                      # adding the raw data (scaled values)
              aes(x = LD1, y = jdate, fill = year), shape = 21) + 
   labs(x = "LD1 score\n(negative is more winter-like)", y = "Sample Date") + 
   theme_classic()+geom_rect(aes(xmin = 1.58, xmax=3, ymin = -Inf, ymax = Inf), fill = "#35B77980", alpha = 0.05)+geom_rect(aes(xmin = -3, xmax=-0.337, ymin = -Inf, ymax = Inf), fill = "#FDE72580", alpha = 0.05)+ scale_fill_manual(values = c("white", "black"))+scale_shape(solid = FALSE)

“To understand if there was an increasing number of winter-like alleles over time among late-summer run fish we fit a linear mixed effects model using lme4 (Bates, Maechler & Bolker, 2012). We fit a model of julian date of sampling at river kilometer 8 to approximate freshwater entry timing, using the individual score on the first discriminant axis of the a priori DAPC as a fixed effect and sample year as a random intercept. We examined residual plots for deviations from homoscedasticity or normality and evaluated significance of the fixed effect using a likelihood ratio test.”

other

interannual variation

let’s check to make sure there are no interesting patterns / structure among halfpounder and late summer years

# create the dataset
half_genind <- seppop(genind_2.1)$halfpounder
years <- str_sub(row.names(half_genind@tab), 6,7)
half_genind$pop <- as.factor(years)

# optimized number of pcs with optim.a.score, smart = F
#half_dapc_year <- dapc(half_genind, n.pca = 200, n.da =1)
#optim.a.score(half_dapc_year, smart = FALSE)
half_dapc_year <- dapc(half_genind, n.pca = 31, n.da =1)
scatter.dapc(half_dapc_year)

Even with DAPC, very little structure year to year in halfpounders.

# create the dataset
fall_genind <- seppop(genind_2.1)$fall
years <- str_sub(row.names(fall_genind@tab), 6,7)
fall_genind$pop <- as.factor(years)

# optimized number of pcs with optim.a.score, smart = F
#fall_dapc_year <- dapc(fall_genind, n.pca = 200, n.da =1)
#optim.a.score(fall_dapc_year, smart = TRUE)
fall_dapc_year <- dapc(fall_genind, n.pca = 1, n.da =1)
scatter.dapc(fall_dapc_year)

Also very little difference between years in fall samples.

data export

#merge polarized allele count data with sample site, population year    sample date sample location genotyped based on the 9 loci in bold-face type origin

# first get the assignments
ld1_2 <- as.data.frame(dapc_full_apriori$ind.coord) %>%
  rownames_to_column(var="sample") %>%
  left_join(pops_info)

# assignments using original DAPC
CIs <- ld1_2 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
assn <- ld1_2 %>%
  mutate(assignment = case_when(
    LD1 < CIs$hiCI[4]  ~ "winter", 
    LD1 > CIs$loCI[3] ~ "summer", 
    LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4] ~ "unassigned")) %>%
  select(sample, assignment)

#get hor/nor
md <- readxl::read_xlsx("./metadata/Omy Rogue2019 steelhead datasheets.xlsx", sheet = 4)

md <- md %>%
  mutate(Origin = case_when(
    Origin == "AD" ~ "HOR",
    Origin == "1" ~ "NOR",
    TRUE ~ Origin
    
  ))

to_exp <- polarized_allele_counts %>%
  rownames_to_column(var = "sample") %>%
  left_join(assn) %>%
  left_join(select(genos_2.3, Sample, Date, run , year), by = c("sample" = "Sample")) %>%
  left_join(select(md, SFGL_ID, Origin), by = c("sample" = "SFGL_ID"))

write_tsv(to_exp, "genotype_data/chr28_polarized_allele_counts_2.txt")

statistical vs physical linkage

filtered <- names(genind_2.1$loc.n.all[(!(names(genind_2.1$loc.n.all) %in% names(unlinked_genind$loc.n.all)))])

filtered <- as.data.frame(filtered) %>%
  left_join(marker_summary, by = c("filtered" = "marker"))


# let's look at the pairs

#first add names to the LD report
ldreport[ldreport$R2>0.2,]

ld_lookup_table <- as.data.frame(genind_2.1$loc.n.all) %>%
  rownames_to_column(var = "loc_name") %>%
  rowid_to_column(var = "loc_number") %>%
  select(loc_name,loc_number )

links <- ldreport[ldreport$R2>0.2] %>%
  left_join(ld_lookup_table, by = c("loc1" = "loc_number")) %>% 
  rename(loc1_name = loc_name) %>%
  left_join(ld_lookup_table, by = c("loc2" = "loc_number")) %>% 
  rename(loc2_name = loc_name) 

linked_genind <- genind_2.1[loc=unique(ldreport[ldreport$R2>0.2,]$loc2)]

Couldn’t find the results from the last time I did this, so ran it again.

  • 7 are run-timing markers on Omy28, where linkage was to other run-timing markers
  • 4 are residency markers on Omy5, where linkage was to other Omy5 markers, or the group below
  • 3 are neutral but very near residency markers, linkage was high between markers in this group or the group above
  • 1 (LDH) is associated with residency in other studies and maps to different locations, sometimes to Omy 5, tossed because of linkage to other LDH marker that has similar mapping results

So at least 15 of the 23 are likely physically linked

3 have no good mapping results 5 map to other locations, some pairs that got tossed mapped to different locations

So there’s evidence of statistical linkage without physical linkage.

As for the opposite pattern, where two markers map close together but are not strongly linked, I only looked closely at the Omy28 markers and Omy5 markers.

All of the Omy5 markers are in tight linkage, but this is not the case for Omy28 (these results are in the notebook – under section LD). The uninformative Omy28 markers are not in strong linkage with others in the region using r^2 or a similar metric r(bar)d, but both metrics are sensitive to allele frequency, and since there was little variation between groups and low maf at these markers, it’s not surprising they would have low r^2 even if they are physically linked.

HOR NOR differences within Winter

Here we check if we can discriminate between NOR and HOR winter run samples.

#assign the origin status as a population
winds <- as.data.frame(row.names(n.pop$Winter$tab))
colnames(winds) <- "sample"

hor_stat <- readxl::read_xlsx("metadata/2019_winter/StW Scales for DNA (06 17 19).xlsx", sheet = 3)
## New names:
## * `` -> ...20
winds %<>%
  left_join(select(hor_stat, 'SFGL ID', 'WorH'), by = c("sample" = "SFGL ID"))

n.pop$Winter@pop <- as.factor(winds$WorH)

dapc_hor <- dapc(n.pop$Winter, n.pca = 41, n.da = 1)
mat <- as.matrix(scaleGen(n.pop$Winter, NA.method="mean", scale=FALSE, center=FALSE))
xpop <- pop(n.pop$Winter)
xval <- xvalDapc(mat, xpop, n.pca.max = 41, training.set = 0.9, result = "overall", center = TRUE, scale = FALSE, n.pca = seq(1,41), n.rep = 50, xval.plot = TRUE)

which.max(xval$`Mean Successful Assignment by Number of PCs of PCA`)
## 1 
## 1
dapc_hor <- dapc(n.pop$Winter, n.pca = 1, n.da = 1)

scatter.dapc(dapc_hor)